Dark Mode Light Mode

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Stop Forcing Your Team Into a Season They're Not In. Start Setting Expectations.
The Store — Where the Archive Actually Lives – Jamie Todd Rubin

The Store — Where the Archive Actually Lives – Jamie Todd Rubin

Upcoming Site Changes – Jamie Todd Rubin Upcoming Site Changes – Jamie Todd Rubin
Upcoming Site Changes – Jamie Todd Rubin


That last post. What can I say. Four thousand plus words? Did I really do that to you? It won't happen again. If I could do it over, I would1. When I set out to (very tentatively) write this series of posts on ark, I intended it as being a way of showcasing something fun that I built that actually turned out to be the thing that I always wanted to build. I forgot to bring the fun to that last post and it reads (to me) like one big “look what I can do!” flail. And this piece! This piece started out in the same direction, only worse. It was mired in technical detail. At one point, in the first draft, I wrote,

This is naturally a more technical piece than most that I write, given the nature of what I am describing. I'll do my best to smooth out those rough edges, but know that I am aware this isn't for my usual audience.

Lazy, lazy, lazy. And not the right intention. Fortunately, I caught onto what I was doing when only a thousand words or so had been set down. Pauciloquy2 is called for and there is still (barely) enough to recover. And so here we are, about to talk about ark‘s store, or, where'd all those files go? Where to begin…?

The Dreaded Org Chart

I've been unsatisfied with the hierarchical of file systems since I ran that very first catalog command on an Apple ][ e. I've lived with them ever since, an accumulating succession of decades that have done little to quench the burning dissatisfaction with the way files are stored on computers. It sometimes seems like a large part of my avocation in technology has been a desperate for a way out of the rigid hierarchy3. “Tear down the wall!”

The search took me to Evernote with its notebooks and tags, and then to Obsidian with its org chart and tags. But Obsidian introduced, to me at least, the notion of a graph: that is links between files that form an edge between nodes. It is a powerful idea, more powerful than I realized. And so when I approached this hobby project and was considering the design, I had two ideas in mind:

  1. Get me off this org chart!
  2. How might I take advantage of graphs?

Simple Requirements

In my limited imagination, there are two poles on the file storage spectrum4: a simple listing of files and a graph of files where every file points to every other file. ark is designed to be as close to the simple listing of files as possible. My requirements were, therefore, simple:

  1. There should only ever be one of each item. Preventing duplicates makes things easier to find. I can't tell you how many times I have found three different copies of a Word document or photo on my computer.
  2. Items in the archive must be described separately from the files themselves. File systems provide the bare minimum capacity for describing a file. An archive is more than a file system so I need a way of describing those files to make finding them as easy as possible.
  3. The archive only stores finished products. Working documents, working files don't get into the archive until they are finished.

With these requirements in hand, I set about meeting each one. To ensure that each item in the archive is unique, it gets a unique file name based on its DNA. The unique set of bytes that make up a file can be “hashed” into a number that is unique for that set of bytes. ark uses sha256 for its hashing mechanism. That number becomes not only the name of the file, but its identifier in ark‘s database. What it means in practice is that if I bring an exact copy of a file into ark that already exists, it doesn't get added a second time, it is simply ignored in favor of the copy that is already in the system.

To describe the files in the system, ark uses a SQLite database. This allows ark to have full-text search and semantic search capabilities without a database server. The SQLite database is just another file on my computer. True, SQLite is not designed to be a multi-user database, but ark is not designed to be a multi-user application, so we're all good here. All of the meta-data needed to describe a file is stored in the database. That meta-data breaks down into five categories:

  1. Classification (doc type, series, sub-series, format, sensitivity, priority, etc.)
  2. Provenance (date authored, record origin, original source path, physical location, etc.)
  3. Identification (doc ID, sha256, title, store path)
  4. Quality (OCR status, OCR quality)
  5. Content (full-text search content, LLM summary, embeddings)

Finally, ark uses a “copy-in” strategy for files. That means that the source file is left untouched, wherever it comes from, and a copy of the file is brought into the archive. Because of this, and the other requirements I listed above, I can store all of the files in a flat structure within ark‘s store. After all, I never need to know the actual file name. I just need to be able to describe what I am looking for and the database takes care of the rest.

Connections

In your standard OS, files sit there on a file system completely unconnected. But in a personal archive, people are first-class citizens. So in addition to the database storing information about files, it stores information about the people in those files. And since it is the connections that make an archive like this come alive, ark supports three kinds of connections:

  1. Document-to-document: one email is linked to another as a “reply-to”. A receipt is linked to an invoice. Documents can link to one another.
  2. Document-to-person: Who wrote it, received it, who's mentioned, who's the subject. The same person may up in lots of roles across documents. This is the single most-queried table in the archive. Anything person-aware reads from here.
  3. Person-to-person: Friends, family, colleagues, including date ranges. A handyman who on your house may retire, and someone else takes over. The connections capture it all.

Here's where things get fun! In an org chart, the answer to “show me everything about my son from the last two years” depends on remembering which folder you filed each thing in. In a graph, it doesn't. I asked Claude Desktop5 that exact question and got back this:

Theater & Performance — The dominant thread. He'd been active in community theater for a while: directing plays, performing in musicals (the creepy Wolf in Into the Woods), and preparing for the role of Edwin Drood at his high school. He's a baritone/tenor in the choral group and was being recruited for an a cappella group. By May 2026, a choral concert I attended. Then the big news: he and a friend auditioned for Guys and Dolls — a six-hour audition — and he was offered the lead role of Nathan Detroit. Around 200 lines.

Prom — May 19, 2026. He went with a friend. Dinner first, then photos at the Jefferson Memorial via Uber Black, then the prom itself. I left the car at the venue so he could drive home. He got home around 10pm, changed, headed to a post-prom party, was home just after 1am. I noted he “seemed very grown up about the whole thing.”

Memorial Day 2026 — At our friends' house. He “christened the pool as he does every year.”

College Prep — From a family Zoom in late May: he's working on the Common Application and asking for recommendation letters. Summer plans still up in the air — SAT prep, camp counselor work, or a possible Europe trip.

Lawn — An October 2025 iMessage asks whether he can cut the lawn that weekend.

Rutgers campus tour — Mentioned in passing in the Zoom notes: I drove by my childhood New Jersey house “after a Rutgers campus tour with him.”

Cotillion — Per a January 2026 letter, he was working as an instructor — teaching students to dance and how to conduct themselves in society.

Driving — Also from that January letter, the dry observation that his having a license “does wonders for the nervous system.”

One question; every connected thing — pulled together by edges the archive had been quietly maintaining while I wasn't looking. Letters, iMessages, Zoom call notes, photos, calendar entries, diaries, social media — formats I don't normally think of as connected, returning one coherent picture of my kid over a two-year span6. The org chart could have stored these. Only the graph could have answered them. (More on how that LLM call works under the hood — bundles, edges, redaction — next post.)

Two Ways to Organize

In looking at how archivists tend to organize , a 4-tiered, um, hierarchy (sorry!) emerged as a trend:

  1. Series (biographical, correspondence, writings, research, professional, financial, legal, medical, etc.).
  2. Sub-series (fixed categories that fall underneath each of the series).
  3. File. A collection of items in a series/sub-series in a physical archive.
  4. Item. The thing itself.

In ark we have mappings to three of the four: series, sub-series, and the item itself.

Series and sub-series are categories that form a controlled vocabulary. But I find it useful to have user-curated groupings as well. While ark can use tags, I created something called a “collection” which is a curated grouping named after the reason that the items are grouped together. For example #2026-tax-documents, or #2019-house-purchase, or #vacation-in-the-golden-age-notes. Documents can, of course, have a series and sub-series, be tagged, and be members of one or more collections. The nice thing about collections is that they can be used as input for other ark commands. For instance:

ark bundle '#vacation-in-the-golden-age-notes' | ark task summarize

which will create a bundle of all of the items in the #vacation-in-the-golden-age-notes collection and then use an LLM to summarize the entire bundle.

Bottom line: a collection is a list of items in the archive with context.

Some Things Aren't Documents

Most things in ark are documents: emails, PDFs, photos, diary entries, Office documents, text files. But several things in the archive aren't documents; instead, they have their own dedicated database tables. These include reading events (to manage my reading list), health data from Apple Health and FitBit, location data pulled off photos and extracted from other sources like diary entries, and more.

I'll write about each of these later on in this series. For the store, it is useful to know that these live alongside the document model and follow the same rules: they are addressable with a unique sha256 identifier, auditable, and integrated with ark‘s command set (although they sometimes have commands of their own).

Two Things That Touch Everything

Two things in ark‘s data model don't sit in any one layer; they sit over all of them:

  1. Sensitivity: This is set on every item that comes into the archive. The archive treats sensitivity as a query filter, not a display hint. Items that are sensitive are automatically routed through different code paths. For example, if I use Claude Desktop to ask a question and the result includes sensitive data, the code path prevents “restricted” data from leaving the local machine so that it never gets to Claude Desktop. If the data is marked “sensitive”, any sensitive information is stripped and replaced with “[REDACTED]” before being sent off the local machine. This is true everywhere data might be exfiltrated off the local machine.
  2. Annotations: I have a lot to say about things (pauciloquy goes only so far). I engineered the annotations layer to sit atop everything in the archive. This way, I can add notes and comments to a document, a book record, a watch event, an Apple Health record, a person — anything in the archive can be annotated. Those annotations are searchable, and they are surfaced most commonly when looking at a document in the archive. This allows me to add context without touching the original item.

Conclusion

This design keeps ark entirely self-contained on my local machine. The sensitivity layer ensures that documents that shouldn't leave the machine don't. SQLite handles full-text and semantic search. And so far, this scales well. As of this writing, my store is 125 GB not counting the SQLite database which adds another 9 GB:

=== ark store stats ===

Strategy: copy

Store path: /Users/[username]/.local/share/ark/store

Store size: 125.0 GB (390,749 files on disk)

In store (DB): 365,774 document(s)

Index-only: 330,680 document(s) (no managed copy)

Total docs: 696,454

Type Total In store Index-only

──────────────────── ─────── ───────── ───────────

email 281,740 272,068 9,672

browser_visit 106,616 0 106,616

image 85,415 77,182 8,233

imessage 41,267 0 41,267

tweet 27,599 0 27,599

watch_event 26,149 0 26,149

calendar_event 21,564 0 21,564

facebook-post 19,234 0 19,234

cli_command 15,189 0 15,189

note 12,914 7 12,907

health_day 9,411 0 9,411

pdf 8,400 8,400 0

blog_comment 8,040 0 8,040

blog_post 7,477 0 7,477

purchase 5,813 0 5,813

music_play 5,194 0 5,194

attachment 4,302 4,302 0

git_commit 2,261 0 2,261

office 2,078 2,078 0

reading_finished 1,547 0 1,547

book 1,257 0 1,257

text 1,114 1,114 0

diary_entry 804 612 192

reminder 373 0 373

action_item 209 0 209

playlist 124 0 124

video 96 0 96

review 83 0 83

weather_snapshot 52 0 52

blog_page 38 0 38

subscription 27 0 27

message 22 0 22

reading_started 19 0 19

code_file 11 11 0

outbox_draft 8 0 8

conversation 3 0 3

day_summary 3 0 3

timeline_event 1 0 1

You'll notice the “Index-only” column has some big numbers. Some doc types — browser history, iMessages, tweets — live as index entries pointing at source databases or cloud accounts. The original bytes aren't worth duplicating, so ark keeps the metadata and content for search but doesn't manage a separate copy.

That's nearly 700,000 items in ark. Most of these items were ingested automatically into ark from a variety of sources. I'll talk about “ingestion at scale” next time.



Source link

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
This article may contain content republished from other sources for educational purposes.
Add a comment Add a comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post
Stop Forcing Your Team Into a Season They're Not In. Start Setting Expectations.

Stop Forcing Your Team Into a Season They're Not In. Start Setting Expectations.