zoobzio January 18, 2026 Edit this page

Architecture

This document explains how chisel parses code internally. It's intended for contributors and users who want to understand the implementation or extend chisel with new providers.

Component Overview

┌─────────────────────────────────────────────────────────┐
│                        Chunker                          │
│  ┌─────────────────────────────────────────────────┐   │
│  │           providers map[Language]Provider        │   │
│  └─────────────────────────────────────────────────┘   │
│                          │                              │
│     ┌────────────────────┼────────────────────┐        │
│     ▼                    ▼                    ▼        │
│ ┌────────┐         ┌──────────┐         ┌─────────┐   │
│ │  Go    │         │TypeScript│         │  Rust   │   │
│ │Provider│         │ Provider │         │Provider │   │
│ └────┬───┘         └────┬─────┘         └────┬────┘   │
│      │                  │                    │        │
│      ▼                  ▼                    ▼        │
│ ┌────────┐         ┌──────────┐         ┌─────────┐   │
│ │go/parser│        │tree-sitter│        │tree-sitter│  │
│ │ (stdlib)│        │  (cgo)   │         │  (cgo)  │   │
│ └────────┘         └──────────┘         └─────────┘   │
└─────────────────────────────────────────────────────────┘

The Chunker routes requests to the appropriate Provider based on language. Each provider uses language-specific parsing—stdlib for Go, tree-sitter for others.

Parsing Strategies

Go Provider

The Go provider uses the standard library's go/parser package. This gives us:

  • Zero external dependencies
  • Mature, well-tested parser
  • Access to Go's AST types

The extraction walks the AST looking for:

  1. Package documentation (*ast.File.Doc)
  2. Function declarations (*ast.FuncDecl)
  3. Type declarations (*ast.GenDecl with token.TYPE)

For methods, we extract the receiver type to build the context chain.

// Simplified extraction
for _, decl := range file.Decls {
    switch d := decl.(type) {
    case *ast.FuncDecl:
        // Extract function/method
    case *ast.GenDecl:
        if d.Tok == token.TYPE {
            // Extract type (struct, interface, alias)
        }
    }
}

Tree-sitter Providers

TypeScript, Python, and Rust use tree-sitter via the go-tree-sitter bindings. Tree-sitter provides:

  • Incremental parsing (not currently used, but available)
  • Error recovery for malformed code
  • Consistent node types across languages

Each provider walks the syntax tree recursively:

func walkNode(node *sitter.Node, content []byte, ctx []string, chunks *[]Chunk) {
    switch node.Type() {
    case "function_declaration":
        *chunks = append(*chunks, extractFunction(node, content, ctx))
    case "class_declaration":
        // Extract class, then walk children with updated context
        newCtx := append(ctx, "class "+className)
        for i := 0; i < node.ChildCount(); i++ {
            walkNode(node.Child(i), content, newCtx, chunks)
        }
    }
    // Continue walking...
}

Markdown Provider

The Markdown provider uses simple string scanning—no AST, no dependencies. It splits on headers:

for i, line := range lines {
    if strings.HasPrefix(line, "#") {
        // Start new section
    }
}

Each header starts a new section chunk. Content accumulates until the next header of equal or higher level.

Context Propagation

Context flows from parent to child during tree walking. When we enter a class, we push its name onto the context stack. Methods inside inherit that context.

class UserService {           // ctx = []
    getUser() {}              // ctx = ["class UserService"]

    class Inner {             // ctx = ["class UserService"]
        helper() {}           // ctx = ["class UserService", "class Inner"]
    }
}

The context is copied (not shared) for each chunk to avoid mutation issues.

Design Q&A

Why separate providers instead of one unified parser?

Dependencies. Tree-sitter requires cgo and pulls in C libraries. Users who only need Go support shouldn't pay that cost. The workspace structure (go.work) isolates dependencies at the module level.

Why not use tree-sitter for Go?

The stdlib parser is battle-tested, has zero dependencies, and runs ~10x faster (see benchmarks). Tree-sitter would add complexity without benefit for Go.

Why preserve the full source in Content?

Embeddings need context. A function signature alone ("func Add(a, b int) int") is less meaningful than the full function with its documentation and body. We let the embedding model decide what matters.

Why use Kind instead of language-specific types?

Consistency. A Python class and a Go struct serve similar purposes. Normalizing to KindClass lets downstream tools treat them uniformly. Language-specific details can be recovered from the source.

Performance

Benchmarks on a Ryzen 5 3600X (representative ~50-line files):

ProviderTimeMemoryAllocations
Go32µs17KB402
TypeScript313µs63KB579
Python328µs63KB569
Rust293µs61KB566
Markdown4µs7KB45

The Go provider is ~10x faster than tree-sitter providers due to stdlib optimization. Markdown is fastest as it does no AST construction.

For large files, tree-sitter's incremental parsing could be leveraged (not currently implemented).

Next Steps