Open Knowledge Format (OKF) for Human-Agent Systems

As artificial intelligence systems and autonomous agents play an increasingly central role in data analytics, a new challenge has emerged: how do we share context, curated metadata, and operational playbooks between humans and AI systems? Historically, metadata has been stored in specialized database catalog systems (like Apache Atlas, Google Cloud Dataplex, or proprietary enterprise wikis) that are difficult for agents to access without custom integrations, or formatted as raw JSON/YAML payloads that are dry and hard for humans to collaborate on.

Dear reader, in this chapter we explore the Open Knowledge Format (OKF)—a minimal, lightweight, human- and agent-friendly convention for representing knowledge surrounding data systems. Originally proposed in draft specs by Google Cloud Platform, OKF takes the position that knowledge should be readable, writeable, diffable (i.e., think output like git diff), and portable using nothing more than markdown files, YAML frontmatter, and standard version control (such as Git).

In the preface to this chapter, we focus on creating a simple Python implementation of a system using OKF and in the final Wrap Up section, we will catalog the diverse ways you can use this example code for building knowledge bases designed to serve both humans and AI systems, alongside practical project ideas for you to build upon.

The examples for this chapter are located in the directory source-code/openknowledge_format.

References & Inspiration

This implementation is based on the concepts and drafts defined by Google:

Concept Blog Post: How the Open Knowledge Format can improve data sharing
Format Specification: Google Cloud Platform Knowledge Catalog - OKF Specification

Note: This is a clean-room implementation of the OKF conceptual specification and does not use Google’s proprietary libraries.

What is Open Knowledge Format (OKF)?

According to the open specification draft, the format is intentionally minimal:

Human-Readable: It consists of standard Markdown documents that can be read directly in a terminal, text editor, or rendered in a browser.
Agent-Parseable: The top of each document contains a YAML frontmatter block that standardizes metadata attributes (such as the asset type, canonical URI, and classification tags).
Diffable: Because it consists of plain-text markdown, changes can be reviewed, tracked, and merged using standard version control like Git.
Portable: There is no schema registry, database, or SDK dependency. If you can git clone or zip a directory, you can share an OKF knowledge bundle.

An OKF repository is organized into a Knowledge Bundle—a self-contained, hierarchical collection of knowledge documents representing concepts. A concept can describe anything: a database table, an API endpoint, a financial KPI metric, or an operational playbook.

Sample Knowledge Bundle Structure

Let’s look at the /bundle directory structure for our sample retail analytics system:

 1 bundle/
 2 ├── index.md                      # Bundle root listing/TOC
 3 ├── tables/
 4 │   ├── sales_events.md          # Database Table: point-of-sale stream
 5 │   ├── products.md              # Database Table: product dimension
 6 │   └── customers.md             # Database Table: anonymized customers
 7 ├── metrics/
 8 │   ├── daily_revenue.md         # Metric: formula and grain
 9 │   └── customer_ltv.md          # Metric: predictive LTV model
10 └── playbooks/
11     └── revenue_drop_investigation.md # Playbook: step-by-step incident response

An OKF Concept Document Example

Here is the markdown content of bundle/metrics/daily_revenue.md. Notice the metadata block (YAML frontmatter) separated by --- lines:

 1 ---
 2 type: Metric
 3 title: Daily Revenue
 4 description: Total net revenue (after discounts, excluding returns) aggregated per calendar day per store.
 5 resource: bigquery://retail-analytics/metrics/daily_revenue
 6 tags: [revenue, daily, finance, KPI]
 7 timestamp: 2026-06-10T00:00:00Z
 8 owner: finance-analytics@example.com
 9 sla: available by 03:00 UTC each morning
10 ---
11 
12 # Daily Revenue
13 
14 **Daily Revenue** is the primary top-line financial KPI for the retail platform.
15 It answers the question: *"How much did we sell today?"*
16 
17 ## Definition
18 
19 ```sql
20 daily_revenue =
21   SUM(
22     sales_events.quantity
23     * sales_events.unit_price_usd
24     * (1 - sales_events.discount_pct / 100)
25   )
26 WHERE
27   sales_events.quantity > 0          -- exclude returns
28   GROUP BY DATE(sales_events.event_ts), sales_events.store_id
29 
30 ## Grain
31 
32 One row per `(date, store_id)`.
33 
34 ## Source Tables
35 
36 * [tables/sales_events](../tables/sales_events.md) — primary fact source
37 
38 ## Important Caveats
39 
40 * Returns (negative `quantity`) are **excluded**.
41 * If `daily_revenue` drops sharply, check the [revenue drop investigation playbook](../playbooks/revenue_drop_investigation.md).

Python Architecture: The OKF Explorer

To query and browse this knowledge bundle, we wrote a Python program that acts as a consumption agent. It reads the folder tree, parses frontmatter metadata and markdown bodies, builds a local keyword index, and feeds relevant knowledge concepts to a local Ollama model to answer natural language questions.

We define the dependencies in pyproject.toml using uv:

1 [project]
2 name = "openknowledge-format"
3 version = "0.1.0"
4 requires-python = ">=3.12"
5 dependencies = [
6     "ollama>=0.6.1",
7 ]

The OKF Parser and Loader

Below is the code in okf_explorer.py that recursively traverses the bundle, parses the YAML blocks, and structures them into Concept objects:

 1 import re
 2 from pathlib import Path
 3 from dataclasses import dataclass, field
 4 
 5 RESERVED_FILENAMES = {"index.md", "log.md"}
 6 
 7 @dataclass
 8 class Concept:
 9     concept_id: str          # e.g., "tables/sales_events"
10     path: Path               # absolute filesystem path
11     frontmatter: dict        # parsed metadata dictionary
12     body: str                # raw markdown content
13 
14     @property
15     def type(self) -> str:
16         return self.frontmatter.get("type", "Unknown")
17 
18     @property
19     def title(self) -> str:
20         return self.frontmatter.get("title", self.concept_id)
21 
22     @property
23     def description(self) -> str:
24         return self.frontmatter.get("description", "")
25 
26     @property
27     def tags(self) -> list[str]:
28         return self.frontmatter.get("tags", [])
29 
30     def as_context_block(self) -> str:
31         """Serializes the concept into a format optimized for LLM context."""
32         lines = [
33             f"## Concept: {self.title}",
34             f"**ID**: {self.concept_id}",
35             f"**Type**: {self.type}",
36             f"**Description**: {self.description}",
37         ]
38         if self.tags:
39             lines.append(f"**Tags**: {', '.join(self.tags)}")
40         lines.append("")
41         lines.append(self.body.strip())
42         return "\n".join(lines)
43 
44 
45 def _parse_simple_yaml(yaml_text: str) -> dict:
46     """Minimal inline YAML parser to extract frontmatter keys and arrays."""
47     result = {}
48     for line in yaml_text.splitlines():
49         line = line.strip()
50         if not line or line.startswith("#") or ":" not in line:
51             continue
52         key, _, rest = line.partition(":")
53         key = key.strip()
54         rest = rest.strip()
55         # Handle inline lists: [tag1, tag2]
56         if rest.startswith("[") and rest.endswith("]"):
57             items = [i.strip().strip('"').strip("'") for i in rest[1:-1].split(",")]
58             result[key] = [i for i in items if i]
59         else:
60             result[key] = rest.strip('"').strip("'")
61     return result
62 
63 
64 def parse_concept(path: Path, bundle_root: Path) -> Concept | None:
65     """Parses an individual OKF document, separating frontmatter and body."""
66     if path.name in RESERVED_FILENAMES:
67         return None
68 
69     text = path.read_text(encoding="utf-8")
70     frontmatter = {}
71     body = text
72     
73     # Locate frontmatter blocks bounded by --- lines
74     fm_match = re.match(r"^---\s*\n(.*?)\n---\s*\n", text, re.DOTALL)
75     if fm_match:
76         frontmatter = _parse_simple_yaml(fm_match.group(1))
77         body = text[fm_match.end():]
78 
79     concept_id = str(path.relative_to(bundle_root).with_suffix(""))
80     concept_id = concept_id.replace("\\", "/")  # Normalize Windows paths
81     
82     return Concept(concept_id, path, frontmatter, body)

We load the complete bundle into a KnowledgeBundle class, which offers simple, database-free search helpers:

 1 @dataclass
 2 class KnowledgeBundle:
 3     root: Path
 4     concepts: list[Concept] = field(default_factory=list)
 5 
 6     @classmethod
 7     def load(cls, root: Path) -> "KnowledgeBundle":
 8         bundle = cls(root=root)
 9         for md_path in sorted(root.rglob("*.md")):
10             concept = parse_concept(md_path, root)
11             if concept:
12                 bundle.concepts.append(concept)
13         return bundle
14 
15     def search(self, query: str) -> list[Concept]:
16         """Rank concepts by keyword match frequency inside title, desc, and body."""
17         keywords = [w.lower() for w in query.split() if len(w) > 2]
18         scored = []
19         for concept in self.concepts:
20             haystack = (concept.title + " " + concept.description + " " + concept.body).lower()
21             score = sum(haystack.count(kw) for kw in keywords)
22             if score > 0:
23                 scored.append((score, concept))
24         scored.sort(key=lambda x: x[0], reverse=True)
25         return [c for _, c in scored]

Our method for search is a simple bag of matching words approach. For a production system I would use a tool like zvec as I did in the example in my Ollama in Action book in chapter “RAG Using zvec Vector Datastore and Local Model” (link to read online).

The LLM Consumption Agent

To build a consumption agent, we wrap the search catalog and hook it up to the local Ollama service. We prompt the model to restrict its answers to the contexts provided, requiring it to cite the concept ID:

 1 import ollama
 2 
 3 class OKFAgent:
 4     SYSTEM_PROMPT = """\
 5 You are a data knowledge assistant. You have been given excerpts from
 6 an Open Knowledge Format (OKF) knowledge bundle describing data tables,
 7 metrics, and operational playbooks.
 8 
 9 Answer the user's question using ONLY the provided knowledge context.
10 Be concise, accurate, and cite the concept ID (e.g., tables/sales_events)
11 when referring to specific assets. If the context does not contain the
12 information, state that clearly instead of guessing.
13 """
14 
15     def __init__(self, bundle: KnowledgeBundle, model: str = "gemma4:e2b-it-qat"):
16         self.bundle = bundle
17         self.model = model
18 
19     def _build_context(self, query: str, top_k: int = 3) -> str:
20         relevant = self.bundle.search(query)[:top_k]
21         if not relevant:
22             relevant = self.bundle.concepts[:top_k]
23         return "\n\n---\n\n".join(c.as_context_block() for c in relevant)
24 
25     def ask(self, question: str) -> str:
26         context = self._build_context(question)
27         user_message = f"## Knowledge Context\n\n{context}\n\n---\n\n## Question\n\n{question}"
28 
29         response = ollama.chat(
30             model=self.model,
31             messages=[
32                 {"role": "system", "content": self.SYSTEM_PROMPT},
33                 {"role": "user", "content": user_message},
34             ]
35         )
36         return response.message.content

Dear reader, this example code is meant to get you started: hack away or vibe code your own applications.

Example Output

Here we round the example program that contains several test queries:

  1  $ uv run okf_explorer.py 
  2 
  3 ======================================================================
  4   Loading OKF Knowledge Bundle
  5 ======================================================================
  6 Bundle root : /Users/markwatson/GITHUB/PythonAIBook/source-code/openknowledge_format/bundle
  7 Contents    : 6 concepts (3× Database Table, 2× Metric, 1× Playbook)
  8 
  9 ======================================================================
 10   All Concepts in Bundle
 11 ======================================================================
 12   [Metric            ]  metrics/customer_ltv
 13                         Predicted total net revenue a customer will generate over their entire r
 14   [Metric            ]  metrics/daily_revenue
 15                         Total net revenue (after discounts, excluding returns) aggregated per ca
 16   [Playbook          ]  playbooks/revenue_drop_investigation
 17                         Step-by-step guide for on-call analysts to diagnose an unexpected drop i
 18   [Database Table    ]  tables/customers
 19                         Anonymized customer dimension with loyalty tier and demographic segment 
 20   [Database Table    ]  tables/products
 21                         Product catalog containing SKU-level attributes, category hierarchy, and
 22   [Database Table    ]  tables/sales_events
 23                         Raw point-of-sale event stream capturing every transaction at the regist
 24 
 25 ======================================================================
 26   Search: 'revenue'
 27 ======================================================================
 28   metrics/daily_revenue  —  Total net revenue (after discounts, excluding returns) aggre
 29   playbooks/revenue_drop_investigation  —  Step-by-step guide for on-call analysts to diagnose an unexp
 30   tables/sales_events  —  Raw point-of-sale event stream capturing every transaction a
 31 
 32 ======================================================================
 33   Filter by type: 'Metric'
 34 ======================================================================
 35   metrics/customer_ltv  —  Customer Lifetime Value (LTV)
 36   metrics/daily_revenue  —  Daily Revenue
 37 
 38 ======================================================================
 39   Filter by tag: 'KPI'
 40 ======================================================================
 41   metrics/customer_ltv  —  Customer Lifetime Value (LTV)
 42   metrics/daily_revenue  —  Daily Revenue
 43 
 44 ======================================================================
 45   LLM Q&A  (model: gemma4:e2b-it-qat)
 46 ======================================================================
 47 
 48 Q1: How is daily revenue calculated and what tables does it use?
 49 ------------------------------------------------------------
 50 Daily revenue is calculated as follows:
 51 
 52 $$
 53 \text{daily\_revenue} = \sum (\text{sales\_events.quantity} \times
 54     \text{sales\_events.unit\_price\_usd} \times (1 -
 55     \frac{\text{sales\_events.discount\_pct}}{100}))
 56 $$
 57 
 58 The calculation excludes returns by filtering for
 59     $\text{sales\_events.quantity} > 0$. The result is aggregated per
 60     calendar day and per store ID
 61     ($\text{DATE(sales\_events.event\_ts)},
 62     \text{sales\_events.store\_id}$).
 63 
 64 **Source Tables:**
 65 The primary source table used for this calculation is
 66     **[tables/sales_events](../tables/sales_events.md)**.
 67 
 68 Q2: What should I do if daily revenue drops suddenly?
 69 ------------------------------------------------------------
 70 If daily revenue drops suddenly, you should follow the **Revenue Drop
 71     Investigation Playbook** (#playbooks/revenue_drop_investigation).
 72     This investigation is triggered when the
 73     [daily_revenue](../metrics/daily_revenue.md) metric shows a
 74     significant unexpected decline (> 10 % day-over-day or > 2 σ below
 75     the 30-day rolling average).
 76 
 77 ### Step 1 — Confirm the drop is real (not a data issue)
 78 1. Check the pipeline run log to ensure the nightly aggregation
 79     completed successfully.
 80 2. Verify row counts in [sales_events](../tables/sales_events.md) for
 81     the affected day. A count near zero typically indicates a pipeline
 82     or CDC failure rather than a business drop.
 83 
 84 ### Step 2 — Isolate by dimension
 85 Run the revenue query broken down by:
 86 *   **Store**: To see if the drop is isolated to one location.
 87 *   **Category**: To check if a specific product category is affected.
 88 *   **Payment method**: To identify shifts in payment mix caused by
 89     processor issues.
 90 *   **Hour of day**: To detect mid-day system outage windows.
 91 
 92 ### Step 3 — Check for external signals
 93 1. Review the incident board for ongoing POS system outages.
 94 2. Check weather and calendar (holidays or local events) for affected
 95     stores.
 96 3. Consult the merchandising team to see if planned promotions have
 97     ended.
 98 
 99 ### Step 4 — Escalate if needed
100 If the root cause is not identified within 30 minutes, escalate based on
101     the symptom:
102 
103 *   **Pipeline / data quality**: Page `#data-engineering-oncall`.
104 *   **POS system outage**: Page `#it-ops-oncall`.
105 *   **Payment processor**: Page `#payments-oncall`.
106 *   **Genuine business drop**: Notify the `VP of Retail` via the
107     standard business escalation path.
108 
109 Q3: What percentage of sales events have a customer ID? And what does that tell us about LTV calculations?
110 ------------------------------------------------------------
111 Based on the provided documentation for **Sales Events**:
112 
113 1.  **Percentage of Sales Events with a Customer ID:**
114     Roughly 38% of transactions have a NULL `customer_id`, which
115     represents anonymous cash sales. This means only approximately 62%
116     of rows in the `sales_events` table contain a non-NULL identifier
117     for a specific customer.
118 
119 2.  **Impact on LTV Calculations:**
120     The **Customer Lifetime Value (LTV)** metric is computed using
121     historical purchase sequences from the `sales_events` table via
122     machine learning models (BG/NBD and Gamma-Gamma) that rely on the
123     `customer_id`. This allows for a forward-looking estimate of the
124     total net revenue a customer will generate, linking transactional
125     data to specific customers.
126 
127 Q4: How do I join sales_events to products correctly for historical reports?
128 ------------------------------------------------------------
129 To join `sales_events` to `products` for historical reports, you must
130     join on `product_id` and filter using a condition that matches the
131     event timestamp against the product's validity period:
132 
133 *   Join `sales_events` and `products` on `product_id`.
134 *   Filter by: `valid_from <= event_ts::date AND (valid_to IS NULL OR
135     valid_to >= event_ts::date)` to retrieve the correct historical
136     version of the product.

Wrap Up

The Open Knowledge Format represents a pragmatic bridge between two paradigms: traditional, human-focused documentation, and structural, API-centric schema specifications. Because it defaults to raw text markdown in Git, engineers do not have to leave their development workflows to document their systems, and automated pipelines can generate or read metadata files directly.

Catalog of Uses for OKF Bundles

AI Copilots & RAG Context: By formatting internal documentation in OKF, agents can easily parse metadata and trace lineage relationships using standard folder parsing. Because each concept is small, it fits comfortably into LLM context windows, reducing token consumption and hallucination rates.
Interactive Developer Wikis: Static site generators (like Hugo, Docusaurus, or MkDocs) can read the /bundle directory directly to render human-navigable wikis. Humans read the markdown in a portal, while consumption agents process the exact same files programmatically.
Data Lineage and Operational Audits: By adding timestamps, ownership keys, and pipeline run logs to the frontmatter of table and metric documents, operations teams can quickly trace data dependencies during outages.
On-call Automation: Incident systems can search the bundle for playbooks tagged with relevant metrics (e.g. searching for daily_revenue matches playbooks/revenue_drop_investigation.md) and automatically attach diagnostic instructions to incoming pager alerts.

Reader Projects and Exercises

Project 1: Automatic OKF Generators. Write a Python pipeline script that inspects a PostgreSQL or BigQuery schema, extracts the column names and comments, and automatically generates or updates the frontmatter and schema tables in bundle/tables/<table_name>.md.
Project 2: Vector Search for OKF Bundles. Replace the simple substring-matching index in KnowledgeBundle.search() with a vector database. Write a script to generate text embeddings for each concept using an Ollama embedding model (like nomic-embed-text) and perform semantic retrieval instead of keyword search.
Project 3: Git Hook Validator. Build a pre-commit Git hook that validates OKF bundles. The hook should check that every markdown file contains valid YAML frontmatter, contains the required type field, and verify that any links to other concepts ([text](../path/to/concept.md)) represent actual files existing in the bundle.

Optional Practice Problems

Here are some optional practice problems to help you master the concepts covered in this chapter:

1. Warm-Up: Robust Frontmatter Parser (Easy)

The custom parser _parse_simple_yaml in okf_explorer.py is lightweight and requires no external libraries, but it cannot handle nested YAML dictionaries or multiline values.

Task: Enhance the parser in one of the following ways:
1. Integrate the standard library tomllib or implement a more robust parser using regular expressions to support multi-line values (like descriptions spanning multiple lines).
2. Implement automatic type coercion. For example, if a key is timestamp, parse the value into a Python datetime object. If a key is sla, strip whitespace and standardize its casing.
3. Write a small test suite in Python to verify your parser correctly handles edge cases, such as values that contain colons (e.g. sla: available by 03:00 UTC).

2. OKF Link Integrity Checker (Medium)

In an OKF bundle, documentation files link to each other using relative markdown links (e.g., [sales_events](../tables/sales_events.md)). If a file name is changed or deleted, these references break.

Task: Write a validation utility in okf_explorer.py that checks the link integrity of the entire bundle:
1. Extend KnowledgeBundle to parse the markdown body of each concept and locate all Markdown-style links: [link text](relative_path).
2. Resolve the relative_path relative to the current concept’s filesystem location.
3. Check if the file at the resolved path exists on disk.
4. Collect all broken links and output them as a structured report (showing the source file, line number, and broken path).

3. Graph-Augmented RAG Retrieval (Hard)

The standard OKFAgent._build_context retrieves the top-scoring concepts independently based on keywords. However, data concepts are highly relational: a metric like daily_revenue references tables/sales_events in its definition, which in turn points to playbooks/revenue_drop_investigation.

Task: Enhance the retrieval agent to support graph-based context traversal:
1. Build a dependency graph where each node is a Concept and directed edges represent references (parsed from Markdown links or specified in frontmatter metadata).
2. When a user asks a question, retrieve the initial top_k concepts using keyword search.
3. Automatically traverse the graph to retrieve all directly connected neighbors (1-hop relation) of these concepts.
4. Combine the initial concepts and their neighbors, de-duplicate them, and construct the prompt context.
5. Test this implementation by asking: “What playbook should I run if the main metric depending on sales_events fails?” Verify that the agent successfully pulls the revenue_drop_investigation playbook even if the playbook itself doesn’t contain the keyword “sales_events”.