Data engineering

Data Engineering in the Era of Generative AI

Pipelines, governance, and quality remain the bedrock—while embeddings, RAG, and agent workloads add new surfaces for the modern data platform.

CognitiveBricksNovember 18, 202415 min read

What changes—and what does not

Batch and streaming ETL, dimensional modeling, and data contracts still anchor enterprise analytics. What changes is the consumer set: retrieval systems, fine-tuning datasets, and offline evaluation sets need curated text, multimodal assets, and often smaller but higher-signal slices than a warehouse-wide dump. The warehouse remains the system of record for many metrics; the knowledge corpus becomes a parallel product with its own lifecycle.

If you only bolt a vector index onto messy SharePoint exports, agents will confidently cite stale policy PDFs and duplicate paragraphs. The engineering problem is the same as always: garbage in, brittle automation out—only the blast radius is larger because user-facing answers sound authoritative.

Diagram: sources flow to warehouse and document lake, through a chunk and embed pipeline to a vector store, with unified lineage feeding BI, RAG, and agents — Figure 1 — Extend the platform: structured and unstructured paths converge under one lineage story so RAG and agents inherit warehouse-grade governance.

Platform extensions for gen-AI workloads

Most mature stacks add four capabilities. None replaces core ingestion; each demands the same operational discipline as your critical pipelines.

Document ingestion with OCR where needed, language detection, deduplication (shingle hashes, simhash, or embedding near-dupe detection), and PII handling aligned to legal review—not a regex-only afterthought.
Chunking strategies tied to domain structure: headings for policies, slides for decks, API sections for OpenAPI, tickets for ITSM exports—not arbitrary 512-token windows unless you have measured parity with human chunk boundaries.
Vector stores as governed assets: row-level security mapped from source ACLs, retention tied to records management, and re-embedding jobs scheduled when embedding models or parsers change.
Prompt and parser versioning treated like feature definitions: hash configs in metadata, pin models, and support A/B or shadow retrieval without corrupting production indexes.

Chunking, overlap, and retrieval hygiene

Overlap between chunks reduces boundary artifacts but increases storage and duplicate hits in dense corpora. Tune overlap using offline recall@k on a labeled question set—not intuition alone. For hierarchical docs, parent-child chunk linking (summary node + leaf passages) often beats flat windows for multi-hop questions.

Pre-compute metadata per chunk: source URI, section title, last_modified, classification labels, and optional structural keys (product line, region). That metadata becomes filters at query time and reduces the need for the LLM to "guess" scope from raw text alone.

Diagram: funnel from raw ingest through deduplication and PII handling to indexed chunks with metrics at each stage — Figure 2 — Treat corpus preparation as a measured funnel: every stage should emit metrics you can alert on.

Governance under higher uncertainty

Model outputs are non-deterministic; governance shifts toward dataset and retrieval provenance—what went in, when it was indexed, which embedding model revision produced the vector, and who authorized access. Data engineers partner with security and legal to classify corpora, enforce redaction before indexing, and prove deletion when retention windows expire (including vector tombstones and reindex jobs).

If you cannot answer "which document version was in the index when this answer was generated?" you are not ready for customer-facing agents—only for internal experiments.

Observability: beyond row counts

Add SLOs for embedding lag (source commit → searchable), index size growth, query latency percentiles, and empty-result rates. Pair technical metrics with product metrics: thumbs-down on answers, escalation rates to human support, and human-judged correctness samples on a fixed golden set refreshed when the corpus changes.

Log retrieval bundles (ids + scores + filters) per request; retain according to privacy policy.
Track drift when upstream schemas or PDF templates change—diff alerts on parse failure spikes.
Run periodic "index audits" sampling chunks for formatting breaks and encoding issues.

Operating model

Small platform teams that own shared connectors, quality checks, and observability for both tables and knowledge corpora reduce duplicate glue across squads. Product teams own domain semantics and acceptance tests for answers; platform owns repeatable patterns for ingestion, embedding, and audit. Data science partners on evaluation design—not as the sole owners of every text pipeline in perpetuity.

Start with one vertical slice (e.g., internal HR policy or a single product manual), harden SLAs and lineage, then generalize templates. Horizontal "index everything" launches rarely survive first contact with compliance.

Engineering and research perspective from the CognitiveBricks team. Practices evolve quickly; validate approaches against your security, license, and compliance requirements.