Architecture
This document explains how chisel parses code internally. It's intended for contributors and users who want to understand the implementation or extend chisel with new providers.
Component Overview
┌─────────────────────────────────────────────────────────┐
│ Chunker │
│ ┌─────────────────────────────────────────────────┐ │
│ │ providers map[Language]Provider │ │
│ └─────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────┼────────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌────────┐ ┌──────────┐ ┌─────────┐ │
│ │ Go │ │TypeScript│ │ Rust │ │
│ │Provider│ │ Provider │ │Provider │ │
│ └────┬───┘ └────┬─────┘ └────┬────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌────────┐ ┌──────────┐ ┌─────────┐ │
│ │go/parser│ │tree-sitter│ │tree-sitter│ │
│ │ (stdlib)│ │ (cgo) │ │ (cgo) │ │
│ └────────┘ └──────────┘ └─────────┘ │
└─────────────────────────────────────────────────────────┘
The Chunker routes requests to the appropriate Provider based on language. Each provider uses language-specific parsing—stdlib for Go, tree-sitter for others.
Parsing Strategies
Go Provider
The Go provider uses the standard library's go/parser package. This gives us:
- Zero external dependencies
- Mature, well-tested parser
- Access to Go's AST types
The extraction walks the AST looking for:
- Package documentation (
*ast.File.Doc) - Function declarations (
*ast.FuncDecl) - Type declarations (
*ast.GenDeclwithtoken.TYPE)
For methods, we extract the receiver type to build the context chain.
// Simplified extraction
for _, decl := range file.Decls {
switch d := decl.(type) {
case *ast.FuncDecl:
// Extract function/method
case *ast.GenDecl:
if d.Tok == token.TYPE {
// Extract type (struct, interface, alias)
}
}
}
Tree-sitter Providers
TypeScript, Python, and Rust use tree-sitter via the go-tree-sitter bindings. Tree-sitter provides:
- Incremental parsing (not currently used, but available)
- Error recovery for malformed code
- Consistent node types across languages
Each provider walks the syntax tree recursively:
func walkNode(node *sitter.Node, content []byte, ctx []string, chunks *[]Chunk) {
switch node.Type() {
case "function_declaration":
*chunks = append(*chunks, extractFunction(node, content, ctx))
case "class_declaration":
// Extract class, then walk children with updated context
newCtx := append(ctx, "class "+className)
for i := 0; i < node.ChildCount(); i++ {
walkNode(node.Child(i), content, newCtx, chunks)
}
}
// Continue walking...
}
Markdown Provider
The Markdown provider uses simple string scanning—no AST, no dependencies. It splits on headers:
for i, line := range lines {
if strings.HasPrefix(line, "#") {
// Start new section
}
}
Each header starts a new section chunk. Content accumulates until the next header of equal or higher level.
Context Propagation
Context flows from parent to child during tree walking. When we enter a class, we push its name onto the context stack. Methods inside inherit that context.
class UserService { // ctx = []
getUser() {} // ctx = ["class UserService"]
class Inner { // ctx = ["class UserService"]
helper() {} // ctx = ["class UserService", "class Inner"]
}
}
The context is copied (not shared) for each chunk to avoid mutation issues.
Design Q&A
Why separate providers instead of one unified parser?
Dependencies. Tree-sitter requires cgo and pulls in C libraries. Users who only need Go support shouldn't pay that cost. The workspace structure (go.work) isolates dependencies at the module level.
Why not use tree-sitter for Go?
The stdlib parser is battle-tested, has zero dependencies, and runs ~10x faster (see benchmarks). Tree-sitter would add complexity without benefit for Go.
Why preserve the full source in Content?
Embeddings need context. A function signature alone ("func Add(a, b int) int") is less meaningful than the full function with its documentation and body. We let the embedding model decide what matters.
Why use Kind instead of language-specific types?
Consistency. A Python class and a Go struct serve similar purposes. Normalizing to KindClass lets downstream tools treat them uniformly. Language-specific details can be recovered from the source.
Performance
Benchmarks on a Ryzen 5 3600X (representative ~50-line files):
| Provider | Time | Memory | Allocations |
|---|---|---|---|
| Go | 32µs | 17KB | 402 |
| TypeScript | 313µs | 63KB | 579 |
| Python | 328µs | 63KB | 569 |
| Rust | 293µs | 61KB | 566 |
| Markdown | 4µs | 7KB | 45 |
The Go provider is ~10x faster than tree-sitter providers due to stdlib optimization. Markdown is fastest as it does no AST construction.
For large files, tree-sitter's incremental parsing could be leveraged (not currently implemented).
Next Steps
- Providers Guide — Language-specific extraction details
- API Reference — Function signatures
- Testing Guide — Testing code that uses chisel