That last post. What can I say. Four thousand plus words? Did I really do that to you? It won't happen again. If I could do it over, I would1. When I set out to (very tentatively) write this series of posts on ark, I intended it as being a way of showcasing something fun that I built that actually turned out to be the thing that I always wanted to build. I forgot to bring the fun to that last post and it reads (to me) like one big “look what I can do!” flail. And this piece! This piece started out in the same direction, only worse. It was mired in technical detail. At one point, in the first draft, I wrote,
This is naturally a more technical piece than most that I write, given the nature of what I am describing. I'll do my best to smooth out those rough edges, but know that I am aware this isn't for my usual audience.
Lazy, lazy, lazy. And not the right intention. Fortunately, I caught onto what I was doing when only a thousand words or so had been set down. Pauciloquy2 is called for and there is still (barely) enough to recover. And so here we are, about to talk about ark‘s store, or, where'd all those files go? Where to begin…?
The Dreaded Org Chart
I've been unsatisfied with the hierarchical structure of file systems since I ran that very first catalog command on an Apple ][ e. I've lived with them ever since, an accumulating succession of decades that have done little to quench the burning dissatisfaction with the way files are stored on computers. It sometimes seems like a large part of my avocation in technology has been a desperate search for a way out of the rigid hierarchy3. “Tear down the wall!”
The search took me to Evernote with its notebooks and tags, and then to Obsidian with its org chart and tags. But Obsidian introduced, to me at least, the notion of a graph: that is links between files that form an edge between nodes. It is a powerful idea, more powerful than I realized. And so when I approached this hobby project and was considering the design, I had two strong ideas in mind:
- Get me off this org chart!
- How might I take advantage of graphs?
Simple Requirements
In my limited imagination, there are two poles on the file storage spectrum4: a simple listing of files and a graph of files where every file points to every other file. ark is designed to be as close to the simple listing of files as possible. My requirements were, therefore, simple:
- There should only ever be one of each item. Preventing duplicates makes things easier to find. I can't tell you how many times I have found three different copies of a Word document or photo on my computer.
- Items in the archive must be described separately from the files themselves. File systems provide the bare minimum capacity for describing a file. An archive is more than a file system so I need a way of describing those files to make finding them as easy as possible.
- The archive only stores finished products. Working documents, working files don't get into the archive until they are finished.
With these requirements in hand, I set about meeting each one. To ensure that each item in the archive is unique, it gets a unique file name based on its digital DNA. The unique set of bytes that make up a file can be “hashed” into a number that is unique for that set of bytes. ark uses sha256 for its hashing mechanism. That number becomes not only the name of the file, but its identifier in ark‘s database. What it means in practice is that if I bring an exact copy of a file into ark that already exists, it doesn't get added a second time, it is simply ignored in favor of the copy that is already in the system.
To describe the files in the system, ark uses a SQLite database. This allows ark to have full-text search and semantic search capabilities without running a database server. The SQLite database is just another file on my computer. True, SQLite is not designed to be a multi-user database, but ark is not designed to be a multi-user application, so we're all good here. All of the meta-data needed to describe a file is stored in the database. That meta-data breaks down into five categories:
- Classification (doc type, series, sub-series, format, sensitivity, priority, etc.)
- Provenance (date authored, record origin, original source path, physical location, etc.)
- Identification (doc ID, sha256, title, store path)
- Quality (OCR status, OCR quality)
- Content (full-text search content, LLM summary, embeddings)
Finally, ark uses a “copy-in” strategy for files. That means that the source file is left untouched, wherever it comes from, and a copy of the file is brought into the archive. Because of this, and the other requirements I listed above, I can store all of the files in a flat structure within ark‘s store. After all, I never need to know the actual file name. I just need to be able to describe what I am looking for and the database takes care of the rest.
Connections
In your standard OS, files sit there on a file system completely unconnected. But in a personal archive, people are first-class citizens. So in addition to the database storing information about files, it stores information about the people in those files. And since it is the connections that make an archive like this come alive, ark supports three kinds of connections:
- Document-to-document: one email is linked to another as a “reply-to”. A receipt is linked to an invoice. Documents can link to one another.
- Document-to-person: Who wrote it, received it, who's mentioned, who's the subject. The same person may show up in lots of roles across documents. This is the single most-queried table in the archive. Anything person-aware reads from here.
- Person-to-person: Friends, family, colleagues, including date ranges. A handyman who works on your house may retire, and someone else takes over. The connections capture it all.
Here's where things get fun! In an org chart, the answer to “show me everything about my son from the last two years” depends on remembering which folder you filed each thing in. In a graph, it doesn't. I asked Claude Desktop5 that exact question and got back this:
Theater & Performance — The dominant thread. He'd been active in community theater for a while: directing plays, performing in musicals (the creepy Wolf in Into the Woods), and preparing for the role of Edwin Drood at his high school. He's a baritone/tenor in the choral group and was being recruited for an a cappella group. By May 2026, a choral concert I attended. Then the big news: he and a friend auditioned for Guys and Dolls — a six-hour audition — and he was offered the lead role of Nathan Detroit. Around 200 lines.
Prom — May 19, 2026. He went with a friend. Dinner first, then photos at the Jefferson Memorial via Uber Black, then the prom itself. I left the car at the venue so he could drive home. He got home around 10pm, changed, headed to a post-prom party, was home just after 1am. I noted he “seemed very grown up about the whole thing.”
Memorial Day 2026 — At our friends' house. He “christened the pool as he does every year.”
College Prep — From a family Zoom in late May: he's working on the Common Application and asking for recommendation letters. Summer plans still up in the air — SAT prep, camp counselor work, or a possible Europe trip.
Lawn — An October 2025 iMessage asks whether he can cut the lawn that weekend.
Rutgers campus tour — Mentioned in passing in the Zoom notes: I drove by my childhood New Jersey house “after a Rutgers campus tour with him.”
Cotillion — Per a January 2026 letter, he was working as an instructor — teaching students to dance and how to conduct themselves in society.
Driving — Also from that January letter, the dry observation that his having a license “does wonders for the nervous system.”
One question; every connected thing — pulled together by edges the archive had been quietly maintaining while I wasn't looking. Letters, iMessages, Zoom call notes, photos, calendar entries, diaries, social media — formats I don't normally think of as connected, returning one coherent picture of my kid over a two-year span6. The org chart could have stored these. Only the graph could have answered them. (More on how that LLM call works under the hood — bundles, edges, redaction — next post.)
Two Ways to Organize
In looking at how archivists tend to organize archives, a 4-tiered, um, hierarchy (sorry!) emerged as a trend:
- Series (biographical, correspondence, writings, research, professional, financial, legal, medical, etc.).
- Sub-series (fixed categories that fall underneath each of the series).
- File. A collection of items in a series/sub-series in a physical archive.
- Item. The thing itself.
In ark we have hard mappings to three of the four: series, sub-series, and the item itself.
Series and sub-series are categories that form a controlled vocabulary. But I find it useful to have user-curated groupings as well. While ark can use tags, I created something called a “collection” which is a curated grouping named after the reason that the items are grouped together. For example #2026-tax-documents, or #2019-house-purchase, or #vacation-in-the-golden-age-notes. Documents can, of course, have a series and sub-series, be tagged, and be members of one or more collections. The nice thing about collections is that they can be used as input for other ark commands. For instance:
ark bundle '#vacation-in-the-golden-age-notes' | ark task summarize
which will create a bundle of all of the items in the #vacation-in-the-golden-age-notes collection and then use an LLM to summarize the entire bundle.
Bottom line: a collection is a list of items in the archive with context.
Some Things Aren't Documents
Most things in ark are documents: emails, PDFs, photos, diary entries, Office documents, text files. But several things in the archive aren't documents; instead, they have their own dedicated database tables. These include reading events (to manage my reading list), health data from Apple Health and FitBit, location data pulled off photos and extracted from other sources like diary entries, and more.
I'll write about each of these later on in this series. For the store, it is useful to know that these live alongside the document model and follow the same rules: they are addressable with a unique sha256 identifier, auditable, and integrated with ark‘s core command set (although they sometimes have commands of their own).
Two Things That Touch Everything
Two things in ark‘s data model don't sit in any one layer; they sit over all of them:
- Sensitivity: This is set on every item that comes into the archive. The archive treats sensitivity as a query filter, not a display hint. Items that are sensitive are automatically routed through different code paths. For example, if I use Claude Desktop to ask a question and the result includes sensitive data, the code path prevents “restricted” data from leaving the local machine so that it never gets to Claude Desktop. If the data is marked “sensitive”, any sensitive information is stripped and replaced with “[REDACTED]” before being sent off the local machine. This is true everywhere data might be exfiltrated off the local machine.
- Annotations: I have a lot to say about things (pauciloquy goes only so far). I engineered the annotations layer to sit atop everything in the archive. This way, I can add notes and comments to a document, a book record, a watch event, an Apple Health record, a person — anything in the archive can be annotated. Those annotations are searchable, and they are surfaced most commonly when looking at a document in the archive. This allows me to add context without touching the original item.
Conclusion
This design keeps ark entirely self-contained on my local machine. The sensitivity layer ensures that documents that shouldn't leave the machine don't. SQLite handles full-text and semantic search. And so far, this scales well. As of this writing, my store is 125 GB not counting the SQLite database which adds another 9 GB:
=== ark store stats ===
Strategy: copy
Store path: /Users/[username]/.local/share/ark/store
Store size: 125.0 GB (390,749 files on disk)
In store (DB): 365,774 document(s)
Index-only: 330,680 document(s) (no managed copy)
Total docs: 696,454
Type Total In store Index-only
──────────────────── ─────── ───────── ───────────
email 281,740 272,068 9,672
browser_visit 106,616 0 106,616
image 85,415 77,182 8,233
imessage 41,267 0 41,267
tweet 27,599 0 27,599
watch_event 26,149 0 26,149
calendar_event 21,564 0 21,564
facebook-post 19,234 0 19,234
cli_command 15,189 0 15,189
note 12,914 7 12,907
health_day 9,411 0 9,411
pdf 8,400 8,400 0
blog_comment 8,040 0 8,040
blog_post 7,477 0 7,477
purchase 5,813 0 5,813
music_play 5,194 0 5,194
attachment 4,302 4,302 0
git_commit 2,261 0 2,261
office 2,078 2,078 0
reading_finished 1,547 0 1,547
book 1,257 0 1,257
text 1,114 1,114 0
diary_entry 804 612 192
reminder 373 0 373
action_item 209 0 209
playlist 124 0 124
video 96 0 96
review 83 0 83
weather_snapshot 52 0 52
blog_page 38 0 38
subscription 27 0 27
message 22 0 22
reading_started 19 0 19
code_file 11 11 0
outbox_draft 8 0 8
conversation 3 0 3
day_summary 3 0 3
timeline_event 1 0 1
You'll notice the “Index-only” column has some big numbers. Some doc types — browser history, iMessages, tweets — live as index entries pointing at source databases or cloud accounts. The original bytes aren't worth duplicating, so ark keeps the metadata and content for search but doesn't manage a separate copy.
That's nearly 700,000 items in ark. Most of these items were ingested automatically into ark from a variety of sources. I'll talk about “ingestion at scale” next time.