Data engineering
Data Engineering in the Era of Generative AI
Pipelines, governance, and quality remain the bedrock—while embeddings, RAG, and agent workloads add new surfaces for the modern data platform.
What changes—and what does not
Batch and streaming ETL, dimensional modeling, and data contracts still anchor enterprise analytics. What changes is the consumer set: retrieval systems, fine-tuning datasets, and offline evaluation sets need curated text, multimodal assets, and often smaller but higher-signal slices than a warehouse-wide dump. The warehouse remains the system of record for many metrics; the knowledge corpus becomes a parallel product with its own lifecycle.
If you only bolt a vector index onto messy SharePoint exports, agents will confidently cite stale policy PDFs and duplicate paragraphs. The engineering problem is the same as always: garbage in, brittle automation out—only the blast radius is larger because user-facing answers sound authoritative.
Platform extensions for gen-AI workloads
Most mature stacks add four capabilities. None replaces core ingestion; each demands the same operational discipline as your critical pipelines.
- Document ingestion with OCR where needed, language detection, deduplication (shingle hashes, simhash, or embedding near-dupe detection), and PII handling aligned to legal review—not a regex-only afterthought.
- Chunking strategies tied to domain structure: headings for policies, slides for decks, API sections for OpenAPI, tickets for ITSM exports—not arbitrary 512-token windows unless you have measured parity with human chunk boundaries.
- Vector stores as governed assets: row-level security mapped from source ACLs, retention tied to records management, and re-embedding jobs scheduled when embedding models or parsers change.
- Prompt and parser versioning treated like feature definitions: hash configs in metadata, pin models, and support A/B or shadow retrieval without corrupting production indexes.
Chunking, overlap, and retrieval hygiene
Overlap between chunks reduces boundary artifacts but increases storage and duplicate hits in dense corpora. Tune overlap using offline recall@k on a labeled question set—not intuition alone. For hierarchical docs, parent-child chunk linking (summary node + leaf passages) often beats flat windows for multi-hop questions.
Pre-compute metadata per chunk: source URI, section title, last_modified, classification labels, and optional structural keys (product line, region). That metadata becomes filters at query time and reduces the need for the LLM to "guess" scope from raw text alone.
Governance under higher uncertainty
Model outputs are non-deterministic; governance shifts toward dataset and retrieval provenance—what went in, when it was indexed, which embedding model revision produced the vector, and who authorized access. Data engineers partner with security and legal to classify corpora, enforce redaction before indexing, and prove deletion when retention windows expire (including vector tombstones and reindex jobs).
If you cannot answer "which document version was in the index when this answer was generated?" you are not ready for customer-facing agents—only for internal experiments.
Observability: beyond row counts
Add SLOs for embedding lag (source commit → searchable), index size growth, query latency percentiles, and empty-result rates. Pair technical metrics with product metrics: thumbs-down on answers, escalation rates to human support, and human-judged correctness samples on a fixed golden set refreshed when the corpus changes.
- Log retrieval bundles (ids + scores + filters) per request; retain according to privacy policy.
- Track drift when upstream schemas or PDF templates change—diff alerts on parse failure spikes.
- Run periodic "index audits" sampling chunks for formatting breaks and encoding issues.
Operating model
Small platform teams that own shared connectors, quality checks, and observability for both tables and knowledge corpora reduce duplicate glue across squads. Product teams own domain semantics and acceptance tests for answers; platform owns repeatable patterns for ingestion, embedding, and audit. Data science partners on evaluation design—not as the sole owners of every text pipeline in perpetuity.
Start with one vertical slice (e.g., internal HR policy or a single product manual), harden SLAs and lineage, then generalize templates. Horizontal "index everything" launches rarely survive first contact with compliance.
Engineering and research perspective from the CognitiveBricks team. Practices evolve quickly; validate approaches against your security, license, and compliance requirements.