Open Knowledge Format (OKF) Bundle Explorer

Dear reader, have you ever felt that modern data governance is a bit too heavy-handed? We build rigid schemas, complex API registries, and massive data catalogs, only to find that our AI agents and human teammates struggle to get a cohesive, high-level picture of how everything fits together.

This is where the Open Knowledge Format (OKF) comes in. OKF is a lightweight, human, and agent friendly convention for representing knowledge including the metadata, context, and curated insights surrounding data and systems. Rather than defining a rigid database schema, OKF organizes concepts as simple Markdown files with YAML frontmatter inside a git-compatible folder structure. This makes it incredibly easy to track in source control, edit by hand, and consume with LLM-based agents.

In this chapter, we will build a TypeScript implementation of an OKF “consumption agent”. We’ll parse a local OKF bundle, index its documents in memory, and query a local LLM via Ollama to answer questions about retail analytics datasets and procedures.

The examples for this chapter are in the directory source-code/open-knowledge-format.

Inspiration and the Specification

Before we write code, let’s look at where these ideas come from. Google has proposed the Open Knowledge Format as a way to improve data sharing across teams and organizations. The specification aims to bridge the gap between technical data resources (like tables in a database) and the business contexts (like key metrics, definitions, and operational playbooks) that make those resources valuable.

Specifically, we want to capture:

Tables: Physical data assets (e.g., in a database or data warehouse).
Metrics: Formulations and definitions of business KPIs.
Playbooks: Manuals or guides that tell analysts what to do when something goes wrong.

The specification suggests organizing these files in a simple directory hierarchy where each markdown document has standard YAML frontmatter to hold structured attributes (type, tags, title, description, etc.) and a Markdown body for human-friendly descriptions.

This implementation is based on the concepts and drafts defined by Google:

Concept Blog Post: How the Open Knowledge Format can improve data sharing
Format Specification: Google Cloud Platform Knowledge Catalog - OKF Specification

Note: This is a clean-room implementation of the OKF conceptual specification and does not use Google’s proprietary libraries.

The Knowledge Bundle Structure

In our project, we have a sample knowledge bundle representing a retail analytics platform under the bundle/ directory:

 1 bundle/
 2 ├── index.md                      # Bundle root index / table of contents
 3 ├── tables/
 4 │   ├── sales_events.md          # Raw transaction data catalog
 5 │   ├── products.md              # Canonical product dimension
 6 │   └── customers.md             # Privacy-safe customer dimension
 7 ├── metrics/
 8 │   ├── daily_revenue.md         # Revenue metric formula
 9 │   └── customer_ltv.md          # LTV predictive model definition
10 └── playbooks/
11     └── revenue_drop_investigation.md # Operational guide for incident management

Let’s look at one of these concept files, dear reader, to see how clean and legible it is. Here is bundle/tables/sales_events.md (partial listing only for brevity, read through the files in bundle/ directory):

 1 ---
 2 type: Database Table
 3 title: Sales Events
 4 description: Raw point-of-sale event stream capturing every transaction at the register level.
 5 resource: postgres://retail-db/analytics/sales_events
 6 tags: [sales, transactions, raw, events]
 7 timestamp: 2026-06-01T09:00:00Z
 8 owner: data-engineering@example.com
 9 row_count_estimate: 2_500_000_000
10 update_frequency: real-time (CDC)
11 ---
12 
13 # Sales Events
14 
15 The `sales_events` table is the **source of truth** for all transactional data
16 in the retail analytics platform. Every scan at a point-of-sale terminal writes
17 a record here within seconds via Change Data Capture.
18 
19 ...

Notice how easy this is to read! A human can view it in a terminal or edit it in VS Code, git tracks every change, and, as we’ll see next, it is perfectly set up for programmatic parsing.

Defining the OKF Data Model

Let’s start by modeling a single document in TypeScript. We define a Concept class to represent each file, holding its identifier, parsed frontmatter, and Markdown body.

Here is the implementation in okf_explorer.ts:

 1 // okf_explorer.ts, The Concept model and metadata mapping
 2 
 3 import fs from "node:fs";
 4 import path from "node:path";
 5 import { fileURLToPath, pathToFileURL } from "node:url";
 6 import ollama from "ollama";
 7 
 8 const RESERVED_FILENAMES = new Set(["index.md", "log.md"]);
 9 const MODEL = "gemma4:e2b-it-qat";
10 const BUNDLE_DIR = path.resolve(
11   path.dirname(fileURLToPath(import.meta.url)),
12   "bundle"
13 );
14 
15 type Frontmatter = Record<string, string | string[]>;
16 
17 export class Concept {
18   /** Relative path without the .md suffix. */
19   conceptId: string;
20   /** Absolute path to the file. */
21   filePath: string;
22   /** Parsed YAML frontmatter. */
23   frontmatter: Frontmatter;
24   /** Markdown body (everything after the frontmatter). */
25   body: string;
26 
27   constructor(
28     conceptId: string,
29     filePath: string,
30     frontmatter: Frontmatter,
31     body: string
32   ) {
33     this.conceptId = conceptId;
34     this.filePath = filePath;
35     this.frontmatter = frontmatter;
36     this.body = body;
37   }
38 
39   // Convenience accessors from frontmatter
40   get type(): string {
41     const value = this.frontmatter.type;
42     return typeof value === "string" ? value : "Unknown";
43   }
44 
45   get title(): string {
46     const value = this.frontmatter.title;
47     return typeof value === "string" ? value : this.conceptId;
48   }
49 
50   get description(): string {
51     const value = this.frontmatter.description;
52     return typeof value === "string" ? value : "";
53   }
54 
55   get tags(): string[] {
56     const value = this.frontmatter.tags;
57     return Array.isArray(value) ? value : [];
58   }
59 
60   asContextBlock(): string {
61     const lines = [
62       `## Concept: ${this.title}`,
63       `**ID**: ${this.conceptId}`,
64       `**Type**: ${this.type}`,
65       `**Description**: ${this.description}`,
66     ];
67     if (this.tags.length > 0) {
68       lines.push(`**Tags**: ${this.tags.join(", ")}`);
69     }
70     lines.push("");
71     lines.push(this.body.trim());
72     return lines.join("\n");
73   }
74 }

To keep our implementation lightweight and self-contained, dear reader, I have written a simple YAML frontmatter parser. It doesn’t support the entire, massive YAML specification, but it handles scalars and inline arrays (like [foo, bar]) perfectly for our metadata needs.

 1 // okf_explorer.ts, Minimal YAML parser and file system walker for OKF concept parsing
 2 
 3 function stripQuotes(value: string): string {
 4   return value.replace(/^["']|["']$/g, "");
 5 }
 6 
 7 function parseSimpleYaml(yamlText: string): Frontmatter {
 8   /**
 9    * Minimal YAML parser for OKF frontmatter.
10    *
11    * Handles the subset used in this bundle:
12    *   - key: scalar value
13    *   - key: [item1, item2, ...]   (inline lists)
14    * Does NOT require a YAML library so the example stays lightweight.
15    * Upgrade to `import yaml from "yaml"; yaml.parse(...)` for production use.
16    */
17   const result: Frontmatter = {};
18   for (const rawLine of yamlText.split(/\r?\n/)) {
19     const line = rawLine.trim();
20     if (!line || line.startsWith("#")) {
21       continue;
22     }
23     const colonIndex = line.indexOf(":");
24     if (colonIndex === -1) {
25       continue;
26     }
27     const key = line.slice(0, colonIndex).trim();
28     let rest = line.slice(colonIndex + 1).trim();
29 
30     if (rest.startsWith("[") && rest.endsWith("]")) {
31       const items = rest
32         .slice(1, -1)
33         .split(",")
34         .map((item) => stripQuotes(item.trim()))
35         .filter((item) => item.length > 0);
36       result[key] = items;
37     } else {
38       result[key] = stripQuotes(rest);
39     }
40   }
41   return result;
42 }
43 
44 function parseConcept(filePath: string, bundleRoot: string): Concept | null {
45   /**
46    * Parse a single OKF concept document.
47    *
48    * Returns null for reserved filenames (index.md, log.md).
49    */
50   const fileName = path.basename(filePath);
51   if (RESERVED_FILENAMES.has(fileName)) {
52     return null;
53   }
54 
55   const text = fs.readFileSync(filePath, "utf-8");
56 
57   // Extract YAML frontmatter delimited by ---
58   let frontmatter: Frontmatter = {};
59   let body = text;
60   const fmMatch = text.match(/^---\s*\n(.*?)\n---\s*\n/s);
61   if (fmMatch && fmMatch[1] !== undefined) {
62     frontmatter = parseSimpleYaml(fmMatch[1]);
63     body = text.slice(fmMatch[0].length);
64   }
65 
66   // Concept ID = path relative to bundle root, without .md suffix
67   let conceptId = path.relative(bundleRoot, filePath).replace(/\.md$/i, "");
68   // Normalise Windows separators
69   conceptId = conceptId.replace(/\\/g, "/");
70 
71   return new Concept(conceptId, filePath, frontmatter, body);
72 }
73 
74 function walkMarkdownFiles(dir: string): string[] {
75   const files: string[] = [];
76   for (const entry of fs.readdirSync(dir, { withFileTypes: true })) {
77     const fullPath = path.join(dir, entry.name);
78     if (entry.isDirectory()) {
79       files.push(...walkMarkdownFiles(fullPath));
80     } else if (entry.isFile() && entry.name.endsWith(".md")) {
81       files.push(fullPath);
82     }
83   }
84   return files;
85 }

Implementing the Knowledge Bundle

Next, we implement the KnowledgeBundle class, which manages loading all files recursively and provides basic filtering and keyword search logic. Since we are working in memory, a simple keyword search scoring mechanism is more than fast enough for typical bundles.

 1 // okf_explorer.ts, In-memory KnowledgeBundle representation and keyword search indexing
 2 
 3 export class KnowledgeBundle {
 4   /** An in-memory representation of an OKF knowledge bundle. */
 5   root: string;
 6   concepts: Concept[];
 7 
 8   constructor(root: string, concepts: Concept[]) {
 9     this.root = root;
10     this.concepts = concepts;
11   }
12 
13   static load(root: string): KnowledgeBundle {
14     /** Recursively walk `root` and parse every concept document. */
15     const concepts: Concept[] = [];
16     for (const mdPath of walkMarkdownFiles(root)) {
17       const concept = parseConcept(mdPath, root);
18       if (concept !== null) {
19         concepts.push(concept);
20       }
21     }
22     concepts.sort((a, b) => a.filePath.localeCompare(b.filePath));
23     return new KnowledgeBundle(root, concepts);
24   }
25 
26   // ------------------------------------------------------------------
27   // Search / index helpers
28   // ------------------------------------------------------------------
29 
30   byType(conceptType: string): Concept[] {
31     /** Return all concepts of a given type (case-insensitive). */
32     const target = conceptType.toLowerCase();
33     return this.concepts.filter((c) => c.type.toLowerCase() === target);
34   }
35 
36   byTag(tag: string): Concept[] {
37     /** Return all concepts that carry a given tag (case-insensitive). */
38     const target = tag.toLowerCase();
39     return this.concepts.filter((c) =>
40       c.tags.some((t) => t.toLowerCase() === target)
41     );
42   }
43 
44   search(query: string): Concept[] {
45     /**
46      * Simple keyword search across title, description, and body text.
47      * Returns concepts sorted by hit count (descending).
48      */
49     const keywords = query
50       .split(/\s+/)
51       .map((w) => w.toLowerCase())
52       .filter((w) => w.length > 2);
53 
54     const scored: Array<{ score: number; concept: Concept }> = [];
55     for (const concept of this.concepts) {
56       const haystack = (
57         concept.title +
58         " " +
59         concept.description +
60         " " +
61         concept.body
62       ).toLowerCase();
63       const score = keywords.reduce((sum, kw) => sum + haystack.split(kw).length - 1, 0);
64       if (score > 0) {
65         scored.push({ score, concept });
66       }
67     }
68     scored.sort((a, b) => b.score - a.score);
69     return scored.map((s) => s.concept);
70   }
71 
72   summary(): string {
73     /** One-line summary of the bundle contents. */
74     const typeCounts = new Map<string, number>();
75     for (const c of this.concepts) {
76       typeCounts.set(c.type, (typeCounts.get(c.type) ?? 0) + 1);
77     }
78     const counts = Array.from(typeCounts.entries())
79       .sort(([a], [b]) => a.localeCompare(b))
80       .map(([type, count]) => `${count}× ${type}`)
81       .join(", ");
82     return `${this.concepts.length} concepts (${counts})`;
83   }
84 }

* * *

Building the Consumption Agent

The OKFAgent acts as our consumption agent. When asked a question, it queries the KnowledgeBundle using our simple keyword search, formats the top results as structured Markdown context blocks, and passes them to a local LLM via Ollama using a system prompt that enforces citing the source concept IDs.

 1 // okf_explorer.ts, OKFAgent using Ollama for context-augmented Q&A
 2 
 3 export class OKFAgent {
 4   /**
 5    * A simple 'consumption agent' that uses an Ollama LLM to answer
 6    * questions about the knowledge bundle.
 7    *
 8    * Following the OKF spec's vision of agents that can read and traverse
 9    * the bundle to surface curated insight.
10    */
11 
12   static SYSTEM_PROMPT =
13     `You are a data knowledge assistant. You have been given excerpts from
14 an Open Knowledge Format (OKF) knowledge bundle, a collection of
15 structured documentation about data tables, metrics, and operational
16 playbooks for a retail analytics platform.
17 
18 Answer the user's question using ONLY the provided knowledge context.
19 Be concise, accurate, and cite the concept ID (e.g. tables/sales_events)
20 when referring to a specific asset. If the answer is not in the context,
21 say so clearly rather than guessing.`.replace(/^\s+/gm, "");
22 
23   bundle: KnowledgeBundle;
24   model: string;
25 
26   constructor(bundle: KnowledgeBundle, model: string = MODEL) {
27     this.bundle = bundle;
28     this.model = model;
29   }
30 
31   buildContext(query: string, topK: number = 4): string {
32     /** Select the most relevant concepts and format them as context. */
33     let relevant = this.bundle.search(query).slice(0, topK);
34     if (relevant.length === 0) {
35       relevant = this.bundle.concepts.slice(0, topK); // fallback: first N
36     }
37     const blocks = relevant.map((c) => c.asContextBlock());
38     return blocks.join("\n\n---\n\n");
39   }
40 
41   async ask(question: string): Promise<string> {
42     /** Send a question to the LLM with relevant OKF context. */
43     const context = this.buildContext(question);
44     const userMessage = `## Knowledge Context\n\n${context}\n\n---\n\n## Question\n\n${question}`;
45 
46     const response = await ollama.chat({
47       model: this.model,
48       messages: [
49         { role: "system", content: OKFAgent.SYSTEM_PROMPT },
50         { role: "user", content: userMessage },
51       ],
52     });
53 
54     return response.message.content ?? "";
55   }
56 }

Tying It All Together

Now, let’s tie everything together in our main entry point. The script will load the bundle, display a summary and index of the files, demonstrate our filter and search methods, and finally send a few queries to our OKFAgent using Ollama.

 1 // okf_explorer.ts, Driver program running the OKF bundle exploration and LLM query loop
 2 
 3 function printSection(title: string): void {
 4   const width = 70;
 5   console.log("\n" + "=".repeat(width));
 6   console.log(`  ${title}`);
 7   console.log("=".repeat(width));
 8 }
 9 
10 async function main(): Promise<void> {
11   // 1. Load the OKF bundle ------------------------------------------------
12   printSection("Loading OKF Knowledge Bundle");
13   const bundle = KnowledgeBundle.load(BUNDLE_DIR);
14   console.log(`Bundle root : ${BUNDLE_DIR}`);
15   console.log(`Contents    : ${bundle.summary()}`);
16 
17   // 2. Show all concept IDs and types ------------------------------------
18   printSection("All Concepts in Bundle");
19   for (const c of bundle.concepts) {
20     console.log(`  [${c.type.padEnd(18)}]  ${c.conceptId}`);
21     if (c.description) {
22       console.log(`  ${" ".repeat(20)}  ${c.description.slice(0, 72)}`);
23     }
24   }
25 
26   // 3. Demonstrate index / search ----------------------------------------
27   printSection("Search: 'revenue'");
28   for (const c of bundle.search("revenue").slice(0, 3)) {
29     console.log(`  ${c.conceptId}  —  ${c.description.slice(0, 60)}`);
30   }
31 
32   printSection("Filter by type: 'Metric'");
33   for (const c of bundle.byType("Metric")) {
34     console.log(`  ${c.conceptId}  —  ${c.title}`);
35   }
36 
37   printSection("Filter by tag: 'KPI'");
38   for (const c of bundle.byTag("KPI")) {
39     console.log(`  ${c.conceptId}  —  ${c.title}`);
40   }
41 
42   // 4. LLM Q&A over the knowledge bundle ---------------------------------
43   printSection(`LLM Q&A  (model: ${MODEL})`);
44 
45   const agent = new OKFAgent(bundle);
46 
47   const questions = [
48     "How is daily revenue calculated and what tables does it use?",
49     "What should I do if daily revenue drops suddenly?",
50     "What percentage of sales events have a customer ID? " +
51       "And what does that tell us about LTV calculations?",
52     "How do I join sales_events to products correctly for historical reports?",
53   ];
54 
55   for (const [i, q] of questions.entries()) {
56     console.log(`\nQ${i + 1}: ${q}`);
57     console.log("-".repeat(60));
58     const answer = await agent.ask(q);
59     console.log(answer);
60   }
61 }
62 
63 if (import.meta.url === pathToFileURL(process.argv[1]!).href) {
64   if (!fs.existsSync(BUNDLE_DIR)) {
65     process.stderr.write(
66       `ERROR: Bundle directory not found: ${BUNDLE_DIR}\n` +
67         "Make sure you run this script from the open-knowledge-format/ directory.\n"
68     );
69     process.exit(1);
70   }
71 
72   await main();
73 }

Running the Explorer and Sample Output

Let’s see what the output looks like. If you run npm start with Ollama active, you will see a detailed trace. Here is an example of what our agent prints for some of the questions:

Q1: How is daily revenue calculated and what tables does it use?

1 Daily revenue is calculated by summing the product of quantity, unit_price_usd, and (1 - discount_pct / 100) from the sales_events table, excluding returns (where quantity <= 0). It is grouped by the event date and store_id.
2 Source concept: metrics/daily_revenue, which depends on the tables/sales_events table.

Q2: What should I do if daily revenue drops suddenly?

1 If daily revenue drops suddenly (decline > 10% day-over-day or > 2 standard deviations below the 30-day rolling average), you should follow the playbooks/revenue_drop_investigation playbook:
2 1. Confirm the drop is real (check pipeline run logs and verify row counts in tables/sales_events).
3 2. Isolate the drop by dimension (Store, Category, Payment method, or Hour of day).
4 3. Check for external signals (POS outages, weather, calendar events, promotions).
5 4. Escalate to the appropriate team if needed (e.g., #data-engineering-oncall).

Q3: What percentage of sales events have a customer ID? And what does that tell us about LTV calculations?

1 According to tables/sales_events, customer_id is NULL for roughly 38% of transactions (anonymous cash sales), meaning only about 62% of sales events have a customer ID.
2 According to metrics/customer_ltv, this anonymous transaction rate represents a key limitation, as LTV calculations are only calculated for registered customers and may underestimate total value or bias results toward credit/debit card users.

Isn’t that neat, dear reader? Because the agent gets the raw text from the playbook and table schemas, it is able to synthesize an answer that connects the data-engineering details (e.g., NULL rates in sales_events) with the operational playbook guidance and metrics.

Summary and Future Improvements

OKF is a useful tool because it fits into existing developer workflows. Instead of checking a web portal or wiki, you update markdown files in git. This means your documentation, metrics definitions, and playbooks are versioned right alongside your code. Furthermore, as we have shown, it is trivial to load these documents as context into a local LLM, turning static documentation into an interactive, context-aware AI data analyst.

If you want to take this example further in your own projects, dear reader, I recommend swapping our simple keyword-based search() method for a semantic vector search using embeddings (as we explored in the earlier local model chapters). This would let your agent locate concepts that share meaning even if they don’t share exact matching words!

Here are two suggested projects for you, dear reader, to further experiment with OKF:

Project 1: Automatic OKF Generators. Write a Python pipeline script that inspects a SQLite, PostgreSQL or BigQuery schema, extracts the column names and comments, and automatically generates or updates the frontmatter and schema tables in bundle/tables/<table_name>.md.
Project 2: Vector Search for OKF Bundles. Replace the simple substring-matching index in KnowledgeBundle.search() with a vector database. Write a script to generate text embeddings for each concept using an Ollama embedding model (like nomic-embed-text) and perform semantic retrieval instead of keyword search.