Reclaiming Data Autonomy as a Peasant

tldr; We are developing the suite of tools required to safely share and manage access to transcripts produced by coding agent sessions. While others are developing similar tools, our tools are designed to be local-first, prioritizing the user's autonomy with explicit opt-ins at network boundaries; and to provide fine-grained access controls at low friction, with sane defaults and settings profiles for varying levels of complexity. This will be a fundamental component of software engineering infrastructure, enabling us to (1) incorporate coding agents into software engineering processes in a sustainable way, (2) create a data commons with the public interest in mind, and (3) allow the public to have autonomy over their data, outside the walled gardens run by corporate actors such as Anthropic or OpenAI.

What coding agents capture

Using coding agents like Claude Code or OpenAI Codex leaves traces on your local machine in the form of transcripts, and these transcripts contain all the messages and tool calls from your coding sessions. This means that much of the software engineering lifecycle can be captured within a series of coding agent transcripts. The amount captured depends on your usage patterns and how reliant you are on these agentic tools. If you're involving agents every step of the way, your entire process of research, product ideation, requirements gathering, iterating, fixing, and maintaining code may all be in your transcript. These traces contain all of your preference data about what you built, valued, and accepted.

The cost of collaboration

The speedup you get from AI coding agents also fizzles once you introduce …. one other human. Collaborating with others means reviewing your teammate's code, or a random contributor's pull request that modified 10K lines in the codebase. As stated by Simon Willison (and one month earlier by Kailash Nadh), "Writing code is cheap now" , but there is still substantial effort when it comes to engineering good, verified, and maintainable solutions for multiple operating systems and form factors — and these engineering decisions would be sitting in the coding agent transcripts. When I'm handed a large slop PR, whether good or bad, I'd prefer to also see the person's engineering process. How were they actually thinking about the problem? Did they do their due diligence? Ask the right questions? Was this a one-prompt job? Many developers haven't tried looking at their past sessions: for those that have, it is often an unpleasant challenge, and figuring out how to share them with others is worse. Within a mountain of JSONL files and coding sessions only named by their UUID, finding the exact set of transcripts that corresponded to a given commit or pull request may be impossible.

Who wins?

And who wins from this arrangement? As the person using these tools, you'd hopefully be getting a portion of the gains, being able to iterate faster and produce orders of magnitude more code to build out your peasant dreams. The other big winner is Anthropic, OpenAI, or the Other AI Company you're using: by agreeing to their terms of use and privacy policies, they have the power to plunder your trove of transcripts with impunity and use your efforts to grow their walled gardens of data. They can also change the terms on you at any time, or roll out changes that nullify your use case.

The gaps in existing tools

If you wanted to contribute your data back to an open data repository (similar to DataClaw) or store transcripts internally for your own team, most of the existing solutions don't come with fine-grained permissions settings (or they put these settings behind a paywall). Sharing of these transcripts also requires understanding of the privacy implications and some data governance expertise, and this can get complicated very fast. Our solutions will provide advanced permissions and access controls to the user, but also help balance the cognitive load of governance by guiding the user with secure and sane defaults, and allowing users to progressively opt-in to more complexity if they want.

These topics have also been explored by Nicholas Vincent, in his substack post "The Coding Agent Data Deal" which further explores implications of the terms of use in popular coding agent tools, and in his follow-up post "The Paradox of Reuse in 2026: A Case of Quasi-Enclosure", which contextualizes the relationship between AI providers, knowledge workers, and AI users as a precarious quasi-enclosure, in which the AI providers can decide the terms of at any time. While beneficial for individual developers now, future terms may be less so, and some popular open-source libraries are already facing harm (closing their doors and limiting who can open PRs as with tldraw, or cURL closing their bug bounty program).

The case for a data commons

While policy-makers are figuring out regulations around AI and agent traces, we need faster-moving data commons in the interim, run by the community it serves. As it stands right now:

the public has no
- input, agency, autonomy over their data,
- or the platforms that host it
- no infrastructure for attribution, or attestation
as peasants, we rely on the government to enact and enforce policies in our interest, but government is
- slow
- complex, maze-like, confusing
- in some cases politically misaligned
- very high-effort to engage with
results in a collective action problem
- needs some amount of coordination and centralization
- outreach and alliance-building

What we're building

This is why we are developing the suite of tools required to safely share and manage access to transcripts from coding agent sessions. This will be a fundamental component of development infrastructure so that (1) the developer community can sustainably incorporate coding agents into their software engineering processes, (2) to create a data commons with the public interest in mind, and (3) to allow the public to have autonomy over their data, outside the walled gardens run by corporate actors such as Anthropic or OpenAI.