Gen2 Database Architecture: Principles And Decisions
This article delves into the architectural decisions made for Service Farm Gen2, focusing on database architecture principles. We'll explore the transition from a Neo4j-centric system to a PostgreSQL-first design, outlining the rationale, implementation, and consequences of these choices. This comprehensive guide will provide insights into database allocation, graph relationship modeling, schema ownership boundaries, and vector embeddings support within Service Farm Gen2.
Context: The Need for Architectural Shift
Service Farm Gen2 is undergoing a significant migration, shifting its core database infrastructure from a Neo4j-centric architecture to one primarily driven by PostgreSQL. This transition necessitates the establishment of clear architectural principles across several key areas. These include database allocation strategies (choosing between PostgreSQL and Neo4j), methodologies for modeling graph relationships with Wikidata integration, defining schema ownership boundaries between service_farm and webapp components, and incorporating support for vector embeddings. Understanding the context behind these decisions is crucial for grasping the overall direction of the project.
In this digital age, data is the lifeblood of any application. The way we store, manage, and access this data significantly impacts the performance, scalability, and maintainability of our systems. For Service Farm Gen2, the initial architecture leaned heavily on Neo4j, a graph database, for managing relationships between data entities. However, as the application evolved, certain limitations and challenges became apparent. Neo4j, while powerful for graph-specific operations, introduced complexities in areas such as operational overhead and query performance for non-graph related data. This led the architecture team to re-evaluate their options and consider a more versatile solution. PostgreSQL, with its robust SQL capabilities and extensions, emerged as a strong contender.
The decision to migrate to PostgreSQL was not taken lightly. It involved a thorough assessment of the existing architecture, identification of pain points, and evaluation of potential alternatives. The team considered factors such as the complexity of managing multiple database systems, the performance characteristics of different database technologies, and the long-term scalability of the chosen solution. The goal was to create a database architecture that not only met the current needs of Service Farm Gen2 but also provided a solid foundation for future growth and innovation. One of the primary drivers behind this shift was the need for simpler operations. Managing two separate database systems, Neo4j and PostgreSQL, introduced significant overhead in terms of backups, scaling, and monitoring. Consolidating to a single database system would streamline these operations and reduce the overall operational burden. Additionally, the team sought to improve query performance, particularly for queries that did not involve complex graph traversals. Standard SQL, supported by PostgreSQL, offered a more familiar and efficient way to handle these types of queries.
Decision 1: PostgreSQL-First, No Neo4j
The Decision
The core architectural decision is to exclusively utilize PostgreSQL for all data storage requirements within Service Farm Gen2. This entails a complete deprecation of Neo4j from the system.
Rationale Behind the Choice
This decision is firmly rooted in several compelling rationales. Firstly, it aligns perfectly with the previously established goals outlined in IMPLEMENTATION_BASIS.md, which emphasizes "simpler ops, better query performance, standard SQL." By consolidating to a single database system, we significantly reduce operational complexity, streamlining tasks such as backups, scaling, and monitoring. Maintaining a single source of truth simplifies the overall architecture and enhances data consistency.
Secondly, modern PostgreSQL boasts remarkable capabilities, including support for recursive Common Table Expressions (CTEs), which are highly effective for graph traversals. This capability effectively mitigates the need for a dedicated graph database like Neo4j, as PostgreSQL can efficiently handle graph-related queries directly. This eliminates the need for data duplication and synchronization between two different database systems, further simplifying the architecture. The move to PostgreSQL offers several tangible benefits. By centralizing data storage, we can leverage PostgreSQL's robust feature set for both relational and graph data. This includes advanced indexing techniques, query optimization strategies, and data integrity constraints. The use of standard SQL provides a familiar and powerful query language for developers, making it easier to write and maintain data access code. Furthermore, PostgreSQL's extensive ecosystem of tools and extensions, including pgvector (discussed later in this article), enables us to implement advanced features such as semantic similarity search and vector embeddings.
Implementation Details
To illustrate the practicality of this decision, consider the following SQL code snippet that demonstrates how graph edges can be stored and traversed within PostgreSQL:
-- Graph edges stored in PostgreSQL
CREATE TABLE core.edges (
id UUID PRIMARY KEY,
from_entity_id UUID REFERENCES core.entities(id),
to_entity_id UUID REFERENCES core.entities(id),
edge_type VARCHAR(50) NOT NULL,
confidence FLOAT,
metadata JSONB,
first_seen TIMESTAMPTZ DEFAULT NOW(),
last_seen TIMESTAMPTZ DEFAULT NOW()
);
-- Graph traversal via recursive CTE
WITH RECURSIVE entity_network AS (
SELECT e.* FROM core.entities e WHERE id = $start_entity
UNION
SELECT e.* FROM core.entities e
JOIN core.edges eg ON eg.to_id = e.id
JOIN entity_network en ON eg.from_id = en.id
WHERE eg.confidence > 0.7
)
SELECT * FROM entity_network;
This example showcases how graph edges are represented as rows in the core.edges table, with foreign keys linking to entities in the core.entities table. The recursive CTE then demonstrates a powerful method for traversing the graph, allowing us to efficiently query relationships between entities.
Decision 2: Hybrid Wikidata Property Integration
The Decision
The chosen approach for integrating Wikidata properties is a hybrid model. This involves utilizing simplified edge types that draw inspiration from Wikidata P-numbers, while also incorporating optional storage of P-numbers for data enrichment purposes.
Rationale Behind the Choice
Wikidata, a vast and collaboratively edited knowledge base, contains an extensive array of properties (over 9000) represented by P-numbers. While integrating Wikidata offers significant benefits, a direct one-to-one mapping of these properties can lead to complexities. The sheer granularity of Wikidata properties is often excessive for our specific use case. Direct mapping poses challenges for querying due to the nuanced nature of Wikidata properties. For example, properties like P276 (place depicted), P131 (located in), and P17 (country) can all relate to the concept of location, making queries cumbersome. The domain of Service Farm Gen2 necessitates custom relationships (e.g., MENTIONED_IN, CO_OCCURS_WITH) that are not natively available in Wikidata. However, leveraging Wikidata's authority for cross-language linking and external validation remains highly valuable. By adopting a hybrid approach, we strike a balance between leveraging Wikidata's rich knowledge base and maintaining a manageable and efficient data model. This strategy allows us to simplify our queries by using domain-specific edge types while still benefiting from Wikidata's authoritative data.
Implementation Details
The following SQL code demonstrates how this hybrid approach is implemented:
CREATE TABLE core.edges (
id UUID PRIMARY KEY,
from_entity_id UUID,
to_entity_id UUID,
-- Our simplified ontology (queryable, fast)
edge_type VARCHAR(50) NOT NULL, -- LOCATED_IN, EMPLOYED_BY, etc.
-- Link to Wikidata properties (optional, for enrichment)
wikidata_properties JSONB, -- {"P276": 0.95, "P131": 0.85}
confidence FLOAT,
source VARCHAR(20), -- 'wikidata', 'extracted', 'user'
metadata JSONB,
first_seen TIMESTAMPTZ DEFAULT NOW(),
last_seen TIMESTAMPTZ DEFAULT NOW()
);
-- Edge types inspired by Wikidata but domain-specific
CREATE TYPE edge_type AS ENUM (
-- Location relationships (P276, P131, P17 → simplified)
'LOCATED_IN',
'BORDERS',
-- Organizational (P108, P463, P361)
'EMPLOYED_BY',
'MEMBER_OF',
'PART_OF',
-- Events (P1344, P828, P156)
'PARTICIPATED_IN',
'CAUSED_BY',
'FOLLOWED_BY',
-- Our custom (news-specific)
'MENTIONED_IN',
'CO_OCCURS_WITH'
);
Here, the edge_type column represents our simplified ontology, allowing for efficient querying. The wikidata_properties column, a JSONB type, provides optional storage for Wikidata P-numbers, enabling enrichment. This allows us to store the specific Wikidata properties that support the relationship, along with a confidence score. This approach offers several benefits:
- Simple queries using our custom edge types.
- Authoritative enrichment via Wikidata P-numbers.
- Ability to synchronize and update data from Wikidata when properties exist.
- Future-proof design: P-number mappings can be added without schema changes.
The following examples illustrate how this hybrid approach is used in practice:
-- Query: Find all locations for an entity
SELECT to_entity.canonical_name
FROM core.edges
WHERE from_entity_id = $entity_id
AND edge_type = 'LOCATED_IN';
-- Enrichment: Store Wikidata evidence
UPDATE core.edges
SET wikidata_properties = '{"P276": 0.95}'
WHERE id = $edge_id;
Decision 3: pgvector for Semantic Embeddings
The Decision
To enhance semantic similarity operations, the decision is to incorporate the pgvector extension within PostgreSQL.
Rationale Behind the Choice
Semantic embeddings play a pivotal role in various functionalities, including semantic phase detection, entity disambiguation, claim clustering for event formation, and cross-language entity linking. The pgvector extension provides the necessary infrastructure to efficiently store and query vector embeddings within PostgreSQL. This enables us to perform complex semantic analysis directly within the database, avoiding the need for external systems or data transfers. pgvector is crucial for implementing semantic phase detection, as outlined in IMPLEMENTATION_BASIS.md. By representing text and entities as vectors, we can measure their semantic similarity and identify related concepts. This is particularly useful for disambiguating entities that have the same name but different meanings. For instance, we can differentiate between "Apple" the company and "apple" the fruit by comparing their vector embeddings. Furthermore, vector embeddings enable us to cluster related claims and events, providing a comprehensive understanding of the information landscape. The ability to perform cross-language entity linking is another significant benefit of using vector embeddings. By embedding entities from different languages into the same vector space, we can identify equivalent concepts and relationships across languages.
Implementation Details
The implementation involves the following steps:
CREATE EXTENSION IF NOT EXISTS vector;
-- Add embeddings to core tables
ALTER TABLE core.entities ADD COLUMN embedding vector(1536);
ALTER TABLE core.claims ADD COLUMN embedding vector(1536);
ALTER TABLE core.events ADD COLUMN embedding vector(1536);
-- Vector similarity indexes
CREATE INDEX ON core.entities USING ivfflat (embedding vector_cosine_ops);
CREATE INDEX ON core.claims USING ivfflat (embedding vector_cosine_ops);
-- Query: Find semantically similar entities
SELECT canonical_name,
embedding <=> $query_embedding AS distance
FROM core.entities
ORDER BY distance
LIMIT 10;
The CREATE EXTENSION IF NOT EXISTS vector; statement installs the pgvector extension if it is not already installed. We then add an embedding column of type vector(1536) to the core.entities, core.claims, and core.events tables. The vector(1536) type indicates that each embedding is a vector with 1536 dimensions. To accelerate similarity searches, we create indexes on the embedding columns using the ivfflat indexing method. The vector_cosine_ops operator class specifies that we want to use cosine similarity as the distance metric. The final SQL query demonstrates how to find semantically similar entities using the <=> operator, which calculates the cosine distance between two vectors. By ordering the results by distance and limiting the number of results, we can efficiently retrieve the most similar entities.
Decision 4: Schema Ownership: service_farm vs webapp
The Decision
To maintain clear separation of concerns and promote independent development, schema ownership is divided as follows: service_farm owns the core.* schema, while webapp owns the app.* and community.* schemas.
Rationale Behind the Choice
This division of schema ownership is crucial for several reasons. It enforces a clear separation of concerns between the knowledge graph (managed by service_farm) and the user experience components (managed by webapp). This separation facilitates independent deployment and versioning of the two systems. Each team can develop and deploy their components without being tightly coupled to the other. An API-first mindset is fostered by this separation, as it necessitates the creation of well-defined interfaces between the two systems. This promotes modularity and allows for future microservice splits. By clearly defining the boundaries between the service_farm and webapp schemas, we minimize the risk of unintended dependencies and conflicts. This allows each team to iterate more quickly and confidently, knowing that their changes will not have unintended side effects on the other system. Furthermore, this separation simplifies testing, as each team can focus on testing their specific components without having to worry about the complexities of the other system.
Implementation Details
The following SQL code illustrates the schema ownership structure:
service_farm owns:
CREATE SCHEMA core;
CREATE TABLE core.pages (...); -- Artifacts
CREATE TABLE core.entities (...); -- Knowledge graph nodes
CREATE TABLE core.events (...); -- Factual happenings
CREATE TABLE core.edges (...); -- Relationships
CREATE TABLE core.claims (...); -- Atomic facts
service_farm provides interface:
CREATE SCHEMA bridge;
CREATE TABLE bridge.artifact_metadata (
artifact_id UUID PRIMARY KEY REFERENCES core.pages(id),
submitted_by_id UUID, -- Soft reference to webapp's users
submission_source VARCHAR(50),
user_metadata JSONB,
submitted_at TIMESTAMPTZ DEFAULT NOW()
);
webapp owns (separate schema, possibly separate repo):
CREATE SCHEMA app;
CREATE TABLE app.users (...);
CREATE TABLE app.user_artifacts (
user_id UUID REFERENCES app.users(id),
artifact_id UUID, -- Soft reference to core.pages
is_bookmarked BOOLEAN,
user_notes TEXT
);
CREATE TABLE app.stories (...);
CREATE TABLE app.votes (...);
CREATE TABLE app.disputes (...);
The service_farm team owns the core schema, which contains the core knowledge graph data. The service_farm team also provides a bridge schema, which acts as an interface between the service_farm and webapp systems. This schema contains tables that facilitate communication and data sharing between the two systems. The webapp team owns the app schema, which contains data related to user accounts, user interactions, and community features. This schema may reside in a separate database or repository, depending on the specific deployment requirements.
This clear separation offers several benefits:
service_farmcan migrate thecoreschema independently.webappcan evolve user features without impacting theservice_farmcode.- Testing isolation is simplified, allowing focused testing of the knowledge graph without authentication complexities.
- Future microservice splits are facilitated, as schemas can be moved to separate databases with minimal disruption.
Consequences of the Decisions
Positive Outcomes
The architectural decisions outlined above yield several positive consequences:
- Simpler operations: Consolidating to a single PostgreSQL instance reduces operational overhead compared to managing both PostgreSQL and Neo4j.
- Standard tooling: Utilizing SQL for graph queries provides a familiar and powerful query language compared to Cypher.
- Better performance: Recursive CTEs in modern PostgreSQL offer efficient graph traversal capabilities.
- Clear boundaries: Schema ownership separation prevents conflicts between
service_farmandwebappdevelopment efforts. - Semantic power: pgvector enables advanced NLP features such as semantic similarity search.
- Wikidata integration: The hybrid approach balances simplicity and authority in integrating Wikidata properties.
Potential Negative Impacts
Despite the numerous benefits, there are potential negative impacts to consider:
- Loss of native graph visualization: The transition away from Neo4j means losing its built-in browser-based graph explorer.
- Learning curve: The team must invest time in learning recursive CTEs as a substitute for Cypher.
- Coordination overhead:
service_farmandwebappteams must collaborate closely to define and maintain thebridgeinterface.
Mitigation Strategies
To address these potential negative impacts, the following mitigation strategies will be implemented:
- Utilize PostgreSQL visualization tools such as pgAdmin or Metabase for graph exploration.
- Document common graph query patterns using recursive CTE examples to aid in learning.
- Establish a clear API contract between
service_farmandwebappto minimize coordination overhead.
Related Resources
For further information, please refer to the following resources:
- IMPLEMENTATION_BASIS.md - Gen2 architecture overview
- docs/ARCHITECTURE.md - Design philosophy and principles
- docs/ARCHITECTURE_DIAGRAMS.md - Visual architecture
Next Steps
The next steps in this architectural evolution include:
- Schema design: Develop a comprehensive Gen2 schema based on the principles outlined in this document.
- Migration plan: Define a detailed plan for migrating data from Gen1 to Gen2 for core tables.
- API contract: Document the interface between
service_farmandwebappto ensure seamless communication. - Proof of concept: Conduct performance testing of recursive CTEs compared to Neo4j for common queries.
In conclusion, the architectural decisions outlined in this document represent a significant step forward for Service Farm Gen2. By embracing PostgreSQL as the primary database, integrating Wikidata properties in a hybrid manner, leveraging pgvector for semantic embeddings, and establishing clear schema ownership, we are building a robust, scalable, and future-proof system. For more in-depth information on database architecture best practices, check out this article on AWS Database Best Practices.