Architecture

This document explains how chisel parses code internally. It's intended for contributors and users who want to understand the implementation or extend chisel with new providers.

Component Overview

┌─────────────────────────────────────────────────────────┐
│                        Chunker                          │
│  ┌─────────────────────────────────────────────────┐   │
│  │           providers map[Language]Provider        │   │
│  └─────────────────────────────────────────────────┘   │
│                          │                              │
│     ┌────────────────────┼────────────────────┐        │
│     ▼                    ▼                    ▼        │
│ ┌────────┐         ┌──────────┐         ┌─────────┐   │
│ │  Go    │         │TypeScript│         │  Rust   │   │
│ │Provider│         │ Provider │         │Provider │   │
│ └────┬───┘         └────┬─────┘         └────┬────┘   │
│      │                  │                    │        │
│      ▼                  ▼                    ▼        │
│ ┌────────┐         ┌──────────┐         ┌─────────┐   │
│ │go/parser│        │tree-sitter│        │tree-sitter│  │
│ │ (stdlib)│        │  (cgo)   │         │  (cgo)  │   │
│ └────────┘         └──────────┘         └─────────┘   │
└─────────────────────────────────────────────────────────┘

The Chunker routes requests to the appropriate Provider based on language. Each provider uses language-specific parsing—stdlib for Go, tree-sitter for others.

Parsing Strategies

Go Provider

The Go provider uses the standard library's go/parser package. This gives us:

Zero external dependencies
Mature, well-tested parser
Access to Go's AST types

The extraction walks the AST looking for:

Package documentation (*ast.File.Doc)
Function declarations (*ast.FuncDecl)
Type declarations (*ast.GenDecl with token.TYPE)

For methods, we extract the receiver type to build the context chain.

// Simplified extraction
for _, decl := range file.Decls {
    switch d := decl.(type) {
    case *ast.FuncDecl:
        // Extract function/method
    case *ast.GenDecl:
        if d.Tok == token.TYPE {
            // Extract type (struct, interface, alias)
        }
    }
}

Tree-sitter Providers

TypeScript, Python, and Rust use tree-sitter via the go-tree-sitter bindings. Tree-sitter provides:

Incremental parsing (not currently used, but available)
Error recovery for malformed code
Consistent node types across languages

Each provider walks the syntax tree recursively:

func walkNode(node *sitter.Node, content []byte, ctx []string, chunks *[]Chunk) {
    switch node.Type() {
    case "function_declaration":
        *chunks = append(*chunks, extractFunction(node, content, ctx))
    case "class_declaration":
        // Extract class, then walk children with updated context
        newCtx := append(ctx, "class "+className)
        for i := 0; i < node.ChildCount(); i++ {
            walkNode(node.Child(i), content, newCtx, chunks)
        }
    }
    // Continue walking...
}

Markdown Provider

The Markdown provider uses simple string scanning—no AST, no dependencies. It splits on headers:

for i, line := range lines {
    if strings.HasPrefix(line, "#") {
        // Start new section
    }
}

Each header starts a new section chunk. Content accumulates until the next header of equal or higher level.

Context Propagation

Context flows from parent to child during tree walking. When we enter a class, we push its name onto the context stack. Methods inside inherit that context.

class UserService {           // ctx = []
    getUser() {}              // ctx = ["class UserService"]

    class Inner {             // ctx = ["class UserService"]
        helper() {}           // ctx = ["class UserService", "class Inner"]
    }
}

The context is copied (not shared) for each chunk to avoid mutation issues.

Design Q&A

Why separate providers instead of one unified parser?

Dependencies. Tree-sitter requires cgo and pulls in C libraries. Users who only need Go support shouldn't pay that cost. The workspace structure (go.work) isolates dependencies at the module level.

Why not use tree-sitter for Go?

The stdlib parser is battle-tested, has zero dependencies, and runs ~10x faster (see benchmarks). Tree-sitter would add complexity without benefit for Go.

Why preserve the full source in Content?

Embeddings need context. A function signature alone ("func Add(a, b int) int") is less meaningful than the full function with its documentation and body. We let the embedding model decide what matters.

Why use Kind instead of language-specific types?

Consistency. A Python class and a Go struct serve similar purposes. Normalizing to KindClass lets downstream tools treat them uniformly. Language-specific details can be recovered from the source.

Performance

Benchmarks on a Ryzen 5 3600X (representative ~50-line files):

Provider	Time	Memory	Allocations
Go	32µs	17KB	402
TypeScript	313µs	63KB	579
Python	328µs	63KB	569
Rust	293µs	61KB	566
Markdown	4µs	7KB	45

The Go provider is ~10x faster than tree-sitter providers due to stdlib optimization. Markdown is fastest as it does no AST construction.

For large files, tree-sitter's incremental parsing could be leveraged (not currently implemented).

Next Steps

Providers Guide — Language-specific extraction details
API Reference — Function signatures
Testing Guide — Testing code that uses chisel