Open Knowledge Format (OKF) for Human-Agent Systems
As artificial intelligence systems and autonomous agents play an increasingly central role in data analytics, a new challenge has emerged: how do we share context, curated metadata, and operational playbooks between humans and AI systems? Historically, metadata has been stored in specialized database catalog systems (like Apache Atlas, Google Cloud Dataplex, or proprietary enterprise wikis) that are difficult for agents to access without custom integrations, or formatted as raw JSON/YAML payloads that are dry and hard for humans to collaborate on.
Dear reader, in this chapter we explore the Open Knowledge Format (OKF)—a minimal, lightweight, human- and agent-friendly convention for representing knowledge surrounding data systems. Originally proposed in draft specs by Google Cloud Platform, OKF takes the position that knowledge should be readable, writeable, diffable (i.e., think output like git diff), and portable using nothing more than markdown files, YAML frontmatter, and standard version control (such as Git).
In the preface to this chapter, we focus on creating a simple Python implementation of a system using OKF and in the final Wrap Up section, we will catalog the diverse ways you can use this example code for building knowledge bases designed to serve both humans and AI systems, alongside practical project ideas for you to build upon.
The examples for this chapter are located in the directory source-code/openknowledge_format.
References & Inspiration
This implementation is based on the concepts and drafts defined by Google:
- Concept Blog Post: How the Open Knowledge Format can improve data sharing
- Format Specification: Google Cloud Platform Knowledge Catalog - OKF Specification
Note: This is a clean-room implementation of the OKF conceptual specification and does not use Google’s proprietary libraries.
What is Open Knowledge Format (OKF)?
According to the open specification draft, the format is intentionally minimal:
- Human-Readable: It consists of standard Markdown documents that can be read directly in a terminal, text editor, or rendered in a browser.
- Agent-Parseable: The top of each document contains a YAML frontmatter block that standardizes metadata attributes (such as the asset type, canonical URI, and classification tags).
- Diffable: Because it consists of plain-text markdown, changes can be reviewed, tracked, and merged using standard version control like Git.
- Portable: There is no schema registry, database, or SDK dependency. If you can
git cloneor zip a directory, you can share an OKF knowledge bundle.
An OKF repository is organized into a Knowledge Bundle—a self-contained, hierarchical collection of knowledge documents representing concepts. A concept can describe anything: a database table, an API endpoint, a financial KPI metric, or an operational playbook.
Sample Knowledge Bundle Structure
Let’s look at the /bundle directory structure for our sample retail analytics system:
1 bundle/
2 ├── index.md # Bundle root listing/TOC
3 ├── tables/
4 │ ├── sales_events.md # Database Table: point-of-sale stream
5 │ ├── products.md # Database Table: product dimension
6 │ └── customers.md # Database Table: anonymized customers
7 ├── metrics/
8 │ ├── daily_revenue.md # Metric: formula and grain
9 │ └── customer_ltv.md # Metric: predictive LTV model
10 └── playbooks/
11 └── revenue_drop_investigation.md # Playbook: step-by-step incident response
An OKF Concept Document Example
Here is the markdown content of bundle/metrics/daily_revenue.md. Notice the metadata block (YAML frontmatter) separated by --- lines:
1 ---
2 type: Metric
3 title: Daily Revenue
4 description: Total net revenue (after discounts, excluding returns) aggregated per calendar day per store.
5 resource: bigquery://retail-analytics/metrics/daily_revenue
6 tags: [revenue, daily, finance, KPI]
7 timestamp: 2026-06-10T00:00:00Z
8 owner: finance-analytics@example.com
9 sla: available by 03:00 UTC each morning
10 ---
11
12 # Daily Revenue
13
14 **Daily Revenue** is the primary top-line financial KPI for the retail platform.
15 It answers the question: *"How much did we sell today?"*
16
17 ## Definition
18
19 ```sql
20 daily_revenue =
21 SUM(
22 sales_events.quantity
23 * sales_events.unit_price_usd
24 * (1 - sales_events.discount_pct / 100)
25 )
26 WHERE
27 sales_events.quantity > 0 -- exclude returns
28 GROUP BY DATE(sales_events.event_ts), sales_events.store_id
29
30 ## Grain
31
32 One row per `(date, store_id)`.
33
34 ## Source Tables
35
36 * [tables/sales_events](../tables/sales_events.md) — primary fact source
37
38 ## Important Caveats
39
40 * Returns (negative `quantity`) are **excluded**.
41 * If `daily_revenue` drops sharply, check the [revenue drop investigation playbook](../playbooks/revenue_drop_investigation.md).
Python Architecture: The OKF Explorer
To query and browse this knowledge bundle, we wrote a Python program that acts as a consumption agent. It reads the folder tree, parses frontmatter metadata and markdown bodies, builds a local keyword index, and feeds relevant knowledge concepts to a local Ollama model to answer natural language questions.
We define the dependencies in pyproject.toml using uv:
1 [project]
2 name = "openknowledge-format"
3 version = "0.1.0"
4 requires-python = ">=3.12"
5 dependencies = [
6 "ollama>=0.6.1",
7 ]
The OKF Parser and Loader
Below is the code in okf_explorer.py that recursively traverses the bundle, parses the YAML blocks, and structures them into Concept objects:
1 import re
2 from pathlib import Path
3 from dataclasses import dataclass, field
4
5 RESERVED_FILENAMES = {"index.md", "log.md"}
6
7 @dataclass
8 class Concept:
9 concept_id: str # e.g., "tables/sales_events"
10 path: Path # absolute filesystem path
11 frontmatter: dict # parsed metadata dictionary
12 body: str # raw markdown content
13
14 @property
15 def type(self) -> str:
16 return self.frontmatter.get("type", "Unknown")
17
18 @property
19 def title(self) -> str:
20 return self.frontmatter.get("title", self.concept_id)
21
22 @property
23 def description(self) -> str:
24 return self.frontmatter.get("description", "")
25
26 @property
27 def tags(self) -> list[str]:
28 return self.frontmatter.get("tags", [])
29
30 def as_context_block(self) -> str:
31 """Serializes the concept into a format optimized for LLM context."""
32 lines = [
33 f"## Concept: {self.title}",
34 f"**ID**: {self.concept_id}",
35 f"**Type**: {self.type}",
36 f"**Description**: {self.description}",
37 ]
38 if self.tags:
39 lines.append(f"**Tags**: {', '.join(self.tags)}")
40 lines.append("")
41 lines.append(self.body.strip())
42 return "\n".join(lines)
43
44
45 def _parse_simple_yaml(yaml_text: str) -> dict:
46 """Minimal inline YAML parser to extract frontmatter keys and arrays."""
47 result = {}
48 for line in yaml_text.splitlines():
49 line = line.strip()
50 if not line or line.startswith("#") or ":" not in line:
51 continue
52 key, _, rest = line.partition(":")
53 key = key.strip()
54 rest = rest.strip()
55 # Handle inline lists: [tag1, tag2]
56 if rest.startswith("[") and rest.endswith("]"):
57 items = [i.strip().strip('"').strip("'") for i in rest[1:-1].split(",")]
58 result[key] = [i for i in items if i]
59 else:
60 result[key] = rest.strip('"').strip("'")
61 return result
62
63
64 def parse_concept(path: Path, bundle_root: Path) -> Concept | None:
65 """Parses an individual OKF document, separating frontmatter and body."""
66 if path.name in RESERVED_FILENAMES:
67 return None
68
69 text = path.read_text(encoding="utf-8")
70 frontmatter = {}
71 body = text
72
73 # Locate frontmatter blocks bounded by --- lines
74 fm_match = re.match(r"^---\s*\n(.*?)\n---\s*\n", text, re.DOTALL)
75 if fm_match:
76 frontmatter = _parse_simple_yaml(fm_match.group(1))
77 body = text[fm_match.end():]
78
79 concept_id = str(path.relative_to(bundle_root).with_suffix(""))
80 concept_id = concept_id.replace("\\", "/") # Normalize Windows paths
81
82 return Concept(concept_id, path, frontmatter, body)
We load the complete bundle into a KnowledgeBundle class, which offers simple, database-free search helpers:
1 @dataclass
2 class KnowledgeBundle:
3 root: Path
4 concepts: list[Concept] = field(default_factory=list)
5
6 @classmethod
7 def load(cls, root: Path) -> "KnowledgeBundle":
8 bundle = cls(root=root)
9 for md_path in sorted(root.rglob("*.md")):
10 concept = parse_concept(md_path, root)
11 if concept:
12 bundle.concepts.append(concept)
13 return bundle
14
15 def search(self, query: str) -> list[Concept]:
16 """Rank concepts by keyword match frequency inside title, desc, and body."""
17 keywords = [w.lower() for w in query.split() if len(w) > 2]
18 scored = []
19 for concept in self.concepts:
20 haystack = (concept.title + " " + concept.description + " " + concept.body).lower()
21 score = sum(haystack.count(kw) for kw in keywords)
22 if score > 0:
23 scored.append((score, concept))
24 scored.sort(key=lambda x: x[0], reverse=True)
25 return [c for _, c in scored]
Our method for search is a simple bag of matching words approach. For a production system I would use a tool like zvec as I did in the example in my Ollama in Action book in chapter “RAG Using zvec Vector Datastore and Local Model” (link to read online).
The LLM Consumption Agent
To build a consumption agent, we wrap the search catalog and hook it up to the local Ollama service. We prompt the model to restrict its answers to the contexts provided, requiring it to cite the concept ID:
1 import ollama
2
3 class OKFAgent:
4 SYSTEM_PROMPT = """\
5 You are a data knowledge assistant. You have been given excerpts from
6 an Open Knowledge Format (OKF) knowledge bundle describing data tables,
7 metrics, and operational playbooks.
8
9 Answer the user's question using ONLY the provided knowledge context.
10 Be concise, accurate, and cite the concept ID (e.g., tables/sales_events)
11 when referring to specific assets. If the context does not contain the
12 information, state that clearly instead of guessing.
13 """
14
15 def __init__(self, bundle: KnowledgeBundle, model: str = "gemma4:e2b-it-qat"):
16 self.bundle = bundle
17 self.model = model
18
19 def _build_context(self, query: str, top_k: int = 3) -> str:
20 relevant = self.bundle.search(query)[:top_k]
21 if not relevant:
22 relevant = self.bundle.concepts[:top_k]
23 return "\n\n---\n\n".join(c.as_context_block() for c in relevant)
24
25 def ask(self, question: str) -> str:
26 context = self._build_context(question)
27 user_message = f"## Knowledge Context\n\n{context}\n\n---\n\n## Question\n\n{question}"
28
29 response = ollama.chat(
30 model=self.model,
31 messages=[
32 {"role": "system", "content": self.SYSTEM_PROMPT},
33 {"role": "user", "content": user_message},
34 ]
35 )
36 return response.message.content
Dear reader, this example code is meant to get you started: hack away or vibe code your own applications.
Example Output
Here we round the example program that contains several test queries:
1 $ uv run okf_explorer.py
2
3 ======================================================================
4 Loading OKF Knowledge Bundle
5 ======================================================================
6 Bundle root : /Users/markwatson/GITHUB/PythonAIBook/source-code/openknowledge_format/bundle
7 Contents : 6 concepts (3× Database Table, 2× Metric, 1× Playbook)
8
9 ======================================================================
10 All Concepts in Bundle
11 ======================================================================
12 [Metric ] metrics/customer_ltv
13 Predicted total net revenue a customer will generate over their entire r
14 [Metric ] metrics/daily_revenue
15 Total net revenue (after discounts, excluding returns) aggregated per ca
16 [Playbook ] playbooks/revenue_drop_investigation
17 Step-by-step guide for on-call analysts to diagnose an unexpected drop i
18 [Database Table ] tables/customers
19 Anonymized customer dimension with loyalty tier and demographic segment
20 [Database Table ] tables/products
21 Product catalog containing SKU-level attributes, category hierarchy, and
22 [Database Table ] tables/sales_events
23 Raw point-of-sale event stream capturing every transaction at the regist
24
25 ======================================================================
26 Search: 'revenue'
27 ======================================================================
28 metrics/daily_revenue — Total net revenue (after discounts, excluding returns) aggre
29 playbooks/revenue_drop_investigation — Step-by-step guide for on-call analysts to diagnose an unexp
30 tables/sales_events — Raw point-of-sale event stream capturing every transaction a
31
32 ======================================================================
33 Filter by type: 'Metric'
34 ======================================================================
35 metrics/customer_ltv — Customer Lifetime Value (LTV)
36 metrics/daily_revenue — Daily Revenue
37
38 ======================================================================
39 Filter by tag: 'KPI'
40 ======================================================================
41 metrics/customer_ltv — Customer Lifetime Value (LTV)
42 metrics/daily_revenue — Daily Revenue
43
44 ======================================================================
45 LLM Q&A (model: gemma4:e2b-it-qat)
46 ======================================================================
47
48 Q1: How is daily revenue calculated and what tables does it use?
49 ------------------------------------------------------------
50 Daily revenue is calculated as follows:
51
52 $$
53 \text{daily\_revenue} = \sum (\text{sales\_events.quantity} \times
54 \text{sales\_events.unit\_price\_usd} \times (1 -
55 \frac{\text{sales\_events.discount\_pct}}{100}))
56 $$
57
58 The calculation excludes returns by filtering for
59 $\text{sales\_events.quantity} > 0$. The result is aggregated per
60 calendar day and per store ID
61 ($\text{DATE(sales\_events.event\_ts)},
62 \text{sales\_events.store\_id}$).
63
64 **Source Tables:**
65 The primary source table used for this calculation is
66 **[tables/sales_events](../tables/sales_events.md)**.
67
68 Q2: What should I do if daily revenue drops suddenly?
69 ------------------------------------------------------------
70 If daily revenue drops suddenly, you should follow the **Revenue Drop
71 Investigation Playbook** (#playbooks/revenue_drop_investigation).
72 This investigation is triggered when the
73 [daily_revenue](../metrics/daily_revenue.md) metric shows a
74 significant unexpected decline (> 10 % day-over-day or > 2 σ below
75 the 30-day rolling average).
76
77 ### Step 1 — Confirm the drop is real (not a data issue)
78 1. Check the pipeline run log to ensure the nightly aggregation
79 completed successfully.
80 2. Verify row counts in [sales_events](../tables/sales_events.md) for
81 the affected day. A count near zero typically indicates a pipeline
82 or CDC failure rather than a business drop.
83
84 ### Step 2 — Isolate by dimension
85 Run the revenue query broken down by:
86 * **Store**: To see if the drop is isolated to one location.
87 * **Category**: To check if a specific product category is affected.
88 * **Payment method**: To identify shifts in payment mix caused by
89 processor issues.
90 * **Hour of day**: To detect mid-day system outage windows.
91
92 ### Step 3 — Check for external signals
93 1. Review the incident board for ongoing POS system outages.
94 2. Check weather and calendar (holidays or local events) for affected
95 stores.
96 3. Consult the merchandising team to see if planned promotions have
97 ended.
98
99 ### Step 4 — Escalate if needed
100 If the root cause is not identified within 30 minutes, escalate based on
101 the symptom:
102
103 * **Pipeline / data quality**: Page `#data-engineering-oncall`.
104 * **POS system outage**: Page `#it-ops-oncall`.
105 * **Payment processor**: Page `#payments-oncall`.
106 * **Genuine business drop**: Notify the `VP of Retail` via the
107 standard business escalation path.
108
109 Q3: What percentage of sales events have a customer ID? And what does that tell us about LTV calculations?
110 ------------------------------------------------------------
111 Based on the provided documentation for **Sales Events**:
112
113 1. **Percentage of Sales Events with a Customer ID:**
114 Roughly 38% of transactions have a NULL `customer_id`, which
115 represents anonymous cash sales. This means only approximately 62%
116 of rows in the `sales_events` table contain a non-NULL identifier
117 for a specific customer.
118
119 2. **Impact on LTV Calculations:**
120 The **Customer Lifetime Value (LTV)** metric is computed using
121 historical purchase sequences from the `sales_events` table via
122 machine learning models (BG/NBD and Gamma-Gamma) that rely on the
123 `customer_id`. This allows for a forward-looking estimate of the
124 total net revenue a customer will generate, linking transactional
125 data to specific customers.
126
127 Q4: How do I join sales_events to products correctly for historical reports?
128 ------------------------------------------------------------
129 To join `sales_events` to `products` for historical reports, you must
130 join on `product_id` and filter using a condition that matches the
131 event timestamp against the product's validity period:
132
133 * Join `sales_events` and `products` on `product_id`.
134 * Filter by: `valid_from <= event_ts::date AND (valid_to IS NULL OR
135 valid_to >= event_ts::date)` to retrieve the correct historical
136 version of the product.
Wrap Up
The Open Knowledge Format represents a pragmatic bridge between two paradigms: traditional, human-focused documentation, and structural, API-centric schema specifications. Because it defaults to raw text markdown in Git, engineers do not have to leave their development workflows to document their systems, and automated pipelines can generate or read metadata files directly.
Catalog of Uses for OKF Bundles
- AI Copilots & RAG Context: By formatting internal documentation in OKF, agents can easily parse metadata and trace lineage relationships using standard folder parsing. Because each concept is small, it fits comfortably into LLM context windows, reducing token consumption and hallucination rates.
- Interactive Developer Wikis: Static site generators (like Hugo, Docusaurus, or MkDocs) can read the
/bundledirectory directly to render human-navigable wikis. Humans read the markdown in a portal, while consumption agents process the exact same files programmatically. - Data Lineage and Operational Audits: By adding timestamps, ownership keys, and pipeline run logs to the frontmatter of table and metric documents, operations teams can quickly trace data dependencies during outages.
- On-call Automation: Incident systems can search the bundle for playbooks tagged with relevant metrics (e.g. searching for
daily_revenuematchesplaybooks/revenue_drop_investigation.md) and automatically attach diagnostic instructions to incoming pager alerts.
Reader Projects and Exercises
- Project 1: Automatic OKF Generators. Write a Python pipeline script that inspects a PostgreSQL or BigQuery schema, extracts the column names and comments, and automatically generates or updates the frontmatter and schema tables in
bundle/tables/<table_name>.md. - Project 2: Vector Search for OKF Bundles. Replace the simple substring-matching index in
KnowledgeBundle.search()with a vector database. Write a script to generate text embeddings for each concept using an Ollama embedding model (likenomic-embed-text) and perform semantic retrieval instead of keyword search. - Project 3: Git Hook Validator. Build a pre-commit Git hook that validates OKF bundles. The hook should check that every markdown file contains valid YAML frontmatter, contains the required
typefield, and verify that any links to other concepts ([text](../path/to/concept.md)) represent actual files existing in the bundle.
Optional Practice Problems
Here are some optional practice problems to help you master the concepts covered in this chapter:
1. Warm-Up: Robust Frontmatter Parser (Easy)
The custom parser _parse_simple_yaml in okf_explorer.py is lightweight and requires no external libraries, but it cannot handle nested YAML dictionaries or multiline values.
-
Task: Enhance the parser in one of the following ways:
- Integrate the standard library
tomllibor implement a more robust parser using regular expressions to support multi-line values (like descriptions spanning multiple lines). - Implement automatic type coercion. For example, if a key is
timestamp, parse the value into a Pythondatetimeobject. If a key issla, strip whitespace and standardize its casing. - Write a small test suite in Python to verify your parser correctly handles edge cases, such as values that contain colons (e.g.
sla: available by 03:00 UTC).
2. OKF Link Integrity Checker (Medium)
In an OKF bundle, documentation files link to each other using relative markdown links (e.g., [sales_events](../tables/sales_events.md)). If a file name is changed or deleted, these references break.
-
Task: Write a validation utility in
okf_explorer.pythat checks the link integrity of the entire bundle:- Extend
KnowledgeBundleto parse the markdown body of each concept and locate all Markdown-style links:[link text](relative_path). - Resolve the
relative_pathrelative to the current concept’s filesystem location. - Check if the file at the resolved path exists on disk.
- Collect all broken links and output them as a structured report (showing the source file, line number, and broken path).
3. Graph-Augmented RAG Retrieval (Hard)
The standard OKFAgent._build_context retrieves the top-scoring concepts independently based on keywords. However, data concepts are highly relational: a metric like daily_revenue references tables/sales_events in its definition, which in turn points to playbooks/revenue_drop_investigation.
-
Task: Enhance the retrieval agent to support graph-based context traversal:
- Build a dependency graph where each node is a
Conceptand directed edges represent references (parsed from Markdown links or specified in frontmatter metadata). - When a user asks a question, retrieve the initial
top_kconcepts using keyword search. - Automatically traverse the graph to retrieve all directly connected neighbors (1-hop relation) of these concepts.
- Combine the initial concepts and their neighbors, de-duplicate them, and construct the prompt context.
- Test this implementation by asking: “What playbook should I run if the main metric depending on sales_events fails?” Verify that the agent successfully pulls the
revenue_drop_investigationplaybook even if the playbook itself doesn’t contain the keyword “sales_events”.