Provenance is not Validation

A signature on a file tells you who's it is. It does not tell you whether what's inside is any good.
Provenance is not Validation

It's Tuesday, and out of the kindness of your heart and a lapse of good judgement, you've agreed to help your friend move. She hands you a sealed cardboard box with a "Jenna's Kitchen Stuff" label. The box has "Fragile" written in confident black marker, and you carefully carry it five flights of stairs.

When she opens it an hour later, you see it contains a salad spinner, a cheese grater, and a roll of sticky tape. Being a stoic, you are surprised but still at peace. The label was correct in the sense that yes, the stuff inside was from Jenna's kitchen. The 'Fragile' label was useless in the sense that it told you nothing about what was actually inside and how careful you should be carrying it.

Let's use your experience as a launching point for some of the cool features in Validibot.

Most of the data-sharing world right now is, broadly speaking, a stack of confidently-labelled cardboard boxes. We have signed datasets, signed simulation models, signed software releases, signed git commits. What we mostly do not have is a portable, machine-checkable claim that the data inside the box was actually checked against the rules someone said it should follow....in this case "make sure all items are truly fragile."

A digital signature gives you provenance and answers who sent it. Schema validation answers does it parse. Neither of them answers the question a careful consumer downstream actually wants to ask, which is did it pass the substantive checks the original author meant to apply? That's the assertion-and-credential gap that Validibot intends to close, validating data according to a specific workflow, and then proving to others the data was validated.

To fill this idea out, let's use an example from the world of biodiversity...well, as I understand it at least, with my only credentials being 'human with interest in biosphere.'

Standardized biodiversity data

Marine and biodiversity researchers around the world publish their observation records into shared repositories that anyone can pull from. There are sets of standards for this data, such as Darwin Core for the records themselves, EML for the dataset-level metadata, and ISO 19115 for the geographic metadata. These standards are mature, well-documented, and handle the format problem nicely.

They don't, however, handle the truth-claim problem. Darwin Core will happily tell you that "decimalLatitude" is a valid number. It will not tell you that the latitude in question falls inside the sampling polygon the researcher actually visited. EML will check that you've included a methods section. It won't check that the protocol you named is the protocol your IRB actually approved. This isn't a flaw in the standards, it's just a different, more advanced type of validation. But the difference matters, especially under FAIR1 1 FAIR β€” Findable, Accessible, Interoperable, Reusable β€” is a set of data-management principles for scholarly research originally articulated in Wilkinson et al. (2016) and now adopted by most major research funders. and CARE2 2 CARE β€” Collective benefit, Authority to control, Responsibility, Ethics β€” is the parallel set of principles for Indigenous data governance, published by the Global Indigenous Data Alliance. principles where data is expected to travel across institutional, national, and Indigenous-sovereignty boundaries and still be trustworthy at the other end.

A worked example

The imaginary, but still highly credentialled, biologist Dr. Rob has just spent six months running an eDNA survey across four estuaries in partnership with two Indigenous-led monitoring groups. He's about to submit his dataset to OBIS, the Ocean Biodiversity Information System run by UNESCO, as his funder requires it.

What he has on disk is three spreadsheets from different field assistants with mildly inconsistent column names, a few coordinates with truncated precision, and a handful of taxon names that don't quite match the authoritative catalog of marine species names, WoRMS.

What Rob needs first is a clean Darwin Core occurrence table: the CSV that will become the core of his eventual Darwin Core archive. He also needs some way to prove to a researcher in 2032 that the occurrence data he submitted today was checked against the rules he said he'd check it against.

Luckily, the developer on Rob's team has set up an instance of Validibot, and Rob created a workflow to validate that data according to his requirements. This worked example deliberately focuses on the occurrence CSV. Rob's records are a table of typed rows, which is exactly what Validibot's Tabular Validator is built for. The workflow applies three layers to that one file, each doing a different job.

(Note: we are working on a Darwin Core archive validator that can unpack an entire archive and send its tabular and XML components to the appropriate validators in the same workflow. Until that exists, the credential in this example covers the occurrence CSV, not the archive as a whole.)

Layer 1: Structured column checks

Validibot is a no-code environment that provides a simple user interface to construct validation workflows. For the first step in his workflow, Rob selects TabularValidator to process incoming tabular data in CSV form. To configure this step, Rob either pastes a Frictionless Table Schema descriptor or β€” more likely on the first pass β€” uploads a sample CSV with correct formatting and lets the step settings page infer one for him.

For a Darwin Core occurrence table a tiny slice looks like this:

Frictionless descriptor (excerpt)
{
  "fields": [
    {"name": "occurrenceID",     "type": "string",  "constraints": {"required": true, "unique": true}},
    {"name": "eventDate",        "type": "date",    "constraints": {"required": true}},
    {"name": "decimalLatitude",  "type": "number",  "constraints": {"required": true, "minimum": -90,  "maximum": 90}},
    {"name": "decimalLongitude", "type": "number",  "constraints": {"required": true, "minimum": -180, "maximum": 180}},
    {"name": "basisOfRecord",    "type": "string",  "constraints": {"required": true, "enum": ["HumanObservation", "MachineObservation", "PreservedSpecimen", "MaterialSample"]}}
  ],
  "primaryKey": "occurrenceID"
}

Reads as: every row needs an occurrence ID, an ISO-8601 date, real-world coordinates, and one of the controlled values of basisOfRecord. The occurrence ID must be unique across the file.

These are checked natively against the parsed CSV file β€” locale-free, deterministic, and reported as one finding per failed check rather than one per failing row, so a column with a thousand bad cells produces one readable finding and a row count, not a thousand. The inconsistent column names across Rob's three spreadsheets get caught here, the truncated coordinates that fall outside ±90/±180 get caught here, the dates that aren't ISO 8601 get caught here. This lane is unglamorous and essential and largely a solved problem.

Layer 2: Row-stage assertions

Structured constraints can't ask whether a coordinate falls inside a polygon, whether a taxon identifier resolves against WoRMS, or whether a Local Contexts notice is attached when a particular site flag is set. Those are cross-field and conditional rules, and they live in the validator's second lane: CEL assertions evaluated per row through the row.* namespace.

The engine compiles each assertion once per run and walks the dataframe; a null or evaluation error is a failure with its own code rather than a silent pass, which is the property that makes the outcome worth signing later on.

For Rob's submission, the row-stage assertion set might look something like this:

1) Coordinates inside the survey bounding box
row.decimalLatitude  >= s.bbox_south && row.decimalLatitude  <= s.bbox_north &&
row.decimalLongitude >= s.bbox_west  && row.decimalLongitude <= s.bbox_east

"Every row's coordinates fall inside the survey area Rob actually visited." The four corners are supplied as workflow signals (s.bbox_*), so the same workflow can be reused across surveys without editing the rule. A true polygon check would need a custom CEL helper we haven't shipped yet β€” for now, a tight bounding box catches the same class of "this record isn't from this study" mistake.

2) Every taxon ID is shaped like a WoRMS LSID
row.taxonID != "" && row.taxonID.startsWith("urn:lsid:marinespecies.org")

"No free-text species names sneaking through. Every row carries an identifier in the WoRMS LSID format." This is a structural guard against typos and free-text leakage, not a live registry call β€” Validibot doesn't reach out to WoRMS from inside an assertion today. A future custom helper could do the actual resolution; the format check is the useful first cut.

3) Local Contexts label present where required
!(row.locationID in s.partner_flagged_sites) || has(row.localContextsNotice)

"For rows from partner-flagged sites, a Local Contexts TK Notice must be attached." Read as logical implication: if the location is one of the partner-flagged sites (s.partner_flagged_sites is a list signal), then a notice must be present; otherwise the rule is trivially satisfied.

None of those checks belong in Darwin Core. They're not format rules. They're substantive claims about what the data actually represents, and Rob is the only one who can write them down, because he's the only one who knows what the dataset is supposed to mean. Each assertion produces at most one finding per outcome class (with a count and a handful of sample row numbers) so a million-row failure is still one readable line in the report.

Layer 3: A signed credential

When all the assertions pass, Validibot can issue a W3C Verifiable Credential3 3 Read this Validibot blog post for more on VCs in Validibot. to serve as proof of what was validated, by whom and when. This credential wraps up three hashes and a timestamp: the hash of the dataset file, the hash of the workflow definition, and the hash of the assertion outcomes. The data itself isn't signed. Only the fingerprints. So the credential is safe to share publicly even when the underlying records are sensitive.

Rob keeps the credential alongside the exact occurrence CSV that Validibot checked, and can include both when he assembles his wider publication package. Anyone who receives that CSV can verify the credential against the public JWKS endpoint of the Validibot instance that signed it. (For a Validibot Cloud-hosted account, that would be app.validibot.com/.well-known/jwks.json.) An end-user gets a yes-or-no answer to the only question that actually matters: was this exact file checked against this exact ruleset, at the time stamped on the credential?

The credential identifies the exact submitted file, the workflow definition used to validate it, and the resulting validation run through cryptographic hashes. It also records when the run completed, its result, and its finding counts. For most users the issuer is Validibot Cloud, which means the verifier's trust chain is "I trust Validibot's published signing key." Labs that want full sovereignty over the signing party can run self-hosted Validibot Pro instead, generate their own keypair, and publish a JWKS at their own domain. In that case, the credential format and verification flow are identical, only the issuer URL and the JWKS host change.

Why I think this matters

FAIR data principles ask us to make data Findable, Accessible, Interoperable, and Reusable. To that I would and an informal "V", (which I guess would make it FAIR-Vuhh) for making sure data is "Validated." And then CARE principles raise the stakes further: data has to comply with Indigenous authority, ethics, and benefit-sharing requirements.

Schema validation alone isn't going to solve that problem. Provenance signatures aren't either. What closes the gap is a portable claim that methodical, substantive rules were defined, the data checked was checked against those rules, and the result was digitally signed in a way a downstream user can then independently verify. That's a small idea with a lot of leverage, and the standards that let Validibot do it (W3C Verifiable Credentials, JWS, JWKS) are well established.

Ah yes. And one big benefit that hasn't been mentioned yet : Rob saves time! Rob can give his team accounts on Validibot, so they can validate their own data...repeatedly...and not contact him until the data passes all checks. Validibot has a web-based submission form, an API and even a command-line tinterface (CLI). You can even connect your Claude Desktop to it via MCP and get Claude to use it directly.

Validibot isn't biodiversity-specific, of course. The above example is the same shape we use today for building energy models and FMU simulations, and it would work just as well for pharmaceutical bioassays, climate model intercomparisons, or any other domain where data crosses trust boundaries and someone downstream needs to verify a claim. I just think biodiversity data is a particularly clear example.

Could you actually do this in Validibot today?

The CSV portion, mostly yes. The Tabular Validator can handle a Darwin Core occurrence CSV as tabular data, with schema-described columns and row.* CEL assertions. Tabular Validator and CEL Assertions are in the community edition, and signed credentials are in the Pro tier.

However, one caveat is scope: this is a CSV workflow, not end-to-end DwC-A validation. We are still building the archive-ingestion layer described above. The second caveat is we don't yet have a ready-made biodiversity validator pack: a curated Frictionless descriptor for the common Darwin Core terms, a custom CEL helper that resolves WoRMS LSIDs against the live registry, a point_in_polygon helper so the survey-area check can be tighter than a bounding box, and a worked example wiring it all together. Those are things I'd need to build, should someone say "yeah that would be helpful!"

So if you work in biodiversity informatics...or in any field where "did this dataset actually pass the checks the author said it should?" is a real question with no good answer...I'd love to hear from you. Get in touch. I'm genuinely curious whether the pattern I've built for simulations translate cleanly into your world.

Until then: don't trust the label on the box. It's not fragile, it's just a salad spinner. Check what's inside. And if you're going to vouch for what's inside, give the next person a way to verify it.

Keep reading