Butterfly Effect

Data intake + governance checklist

The operational gate before any institutional cohort enters modeling.

This checklist turns governance and intake into an explicit workflow: before transfer, after receipt, during linkage audit, during schema audit, and before any dataset is allowed into the analytical stack.

Core principle

The project does not need direct identifiers or full health-system integration to begin. It does need stable pseudonymized linkage, explicit date logic, documented units and device lineage, and a narrow scope for the first pilot.

Pseudonymized linkage Explicit date logic Units and lineage No fabricated joins

Stages

Intake should be a gate, not an assumption

A dataset should not enter modeling because transfer succeeded. It should enter only after linkage, units, scope, and governance all pass an explicit review.

Before transfer

  • named data owner
  • cohort description
  • transfer path
  • DUA or approval path

File receipt

  • manifest match
  • delivery integrity
  • source system logged
  • raw delivery preserved read-only

Identity and linkage

  • stable `participant_uid`
  • explicit participant-night linkage
  • no fabricated joins
  • date logic documented

Schema and unit audit

  • field mapping complete
  • units normalized
  • device modality documented
  • score semantics documented

Quality and feasibility

  • participant count
  • median nights
  • endpoint density
  • missingness summary

Decision gate

  • `accept_for_pilot`
  • `accept_with_limits`
  • `hold_pending_clarification`
  • `reject_for_current_scope`

Red flags

These should stop the workflow

The point of a checklist is to block weak datasets early, not to rationalize them after transfer.

Hard stops

  • join would depend on row order
  • direct identifiers appear in analytical tables
  • no sleep-date to report-date logic
  • missing unit definitions for core physiology
  • only aggregate exports are available

Required minimal governance posture

  • pseudonymized subject IDs
  • no direct identifiers in analytical tables
  • narrow documented research scope
  • file-level provenance
  • raw and transformed outputs separated