Butterfly Effect

Institutional data requirements

The technical data pack for a serious Butterfly Effect pilot.

This pack translates the current stack into a concrete data request: what is strictly required, what materially increases scientific value, and what file contract is sufficient to start a harmonized longitudinal pilot. The current execution lane is `raw nightly stress`, so the pack now includes the sidecar contract and the normalizer needed to absorb institution HR/IBI epoch exports.

Minimum useful package

The current stack does not require full PSG on every subject. The minimum useful institutional package is longitudinal nightly sleep, nocturnal cardiovascular signal, movement or fragmentation, repeated outcomes, and basic day-level context.

Nightly sleep summaries Nocturnal HR / HRV Repeated outcomes PSG subset optional but high-value
Core files
5
Recommended minimum
40 participants
Median exposure
20 nights
Main unlock
Night transport

What richer data unlocks

  • Stronger cross-cohort nightly transport
  • Cleaner anxiety and depression portability studies
  • Physiological anchoring of interpretation claims
  • More defensible subgroup and calibration analyses

Minimum entities

The current stack can start from five core tables

Everything else improves fidelity. These five tables are the smallest clean institutional handoff that still supports route training, harmonization, and held-out evaluation.

participant

  • `participant_uid`
  • `source_dataset`
  • `source_subject_id`
  • recommended: age and sex metadata

night_sleep

  • night ID and local sleep date
  • sleep duration
  • continuity metric
  • nocturnal HR/HRV metric
  • movement or fragmentation metric

day_emotion / day_symptoms

  • local report date
  • stress and/or anxiety and/or depression
  • fatigue and/or pain when relevant
  • one row per participant-day report

day_confounders

  • caffeine
  • alcohol
  • exercise
  • naps
  • acute illness
  • medication change

Supplementary sidecars

  • IBI or beat-level intervals
  • epoch HR
  • epoch movement
  • respiration / SpO2
  • small PSG subset

File contract

The preferred first pilot handoff

The point is not to request every possible file. The point is to get a handoff that can be harmonized quickly and defended methodologically.

Core files

  • `participant.csv`
  • `night_sleep.csv`
  • `day_emotion.csv`
  • `day_symptoms.csv`
  • `day_confounders.csv`

Operational rules

  • stable pseudonymized IDs
  • local dates or timestamps with timezone logic
  • one row per participant-night in `night_sleep`
  • device and unit documentation
  • source lineage preserved

Operational pack

The field-level shortlist and templates are ready to send

The data pack defines scope. The variable shortlist and header-only templates make the first institutional handoff concrete.

Operational next step

The pilot protocol and intake gate are now explicit

The data pack defines what to ask for. The pilot protocol defines what to do next, and the intake checklist defines what must pass before any cohort enters modeling.

Data intake + governance

  • Open intake checklist
  • linkage audit
  • schema and unit audit
  • governance handling gate
  • training/normalize_raw_nightly_physio_epochs.py for HR/IBI epoch normalization

Pilot thresholds

The practical bar for a useful first pilot

These are not hard exclusion rules. They are the minimum level at which the pilot starts to answer the scientific question instead of just proving ingestion works.

Recommended minimum

  • `>= 40` participants
  • median `>= 20` nights per participant
  • repeated outcomes on `>= 30%` of nights or `>= 3` times per week
  • stable night-to-next-day linkage

Preferred

  • `>= 75` participants
  • median `>= 45` nights per participant
  • nightly or near-daily outcomes
  • at least one raw sidecar
  • a small PSG subset