ofhsyn
Synthetic Our Future Health Data Generator
Generates synthetic Our Future Health cohort datasets for method development, including participant, questionnaire, clinic measurements, outpatient, inpatient, emergency, mortality, primary care medication, and geography outputs. Supports reproducible generation with configurable cohort size and user-defined International Classification of Diseases, Tenth Revision (ICD-10), Office of Population Censuses and Surveys Classification of Interventions and Procedures, version 4 (OPCS-4), and British National Formulary (BNF) code pools.
README
# Questionnaire Modular Generator This folder contains the modular questionnaire generation pipeline used by: - `../generate_questionnaire_data.R` The goal is to keep questionnaire generation split by questionnaire section while still producing one final `questionnaire_data.csv` in `../../data/`. ## High-level flow 1. `../generate_questionnaire_data.R` is the entrypoint (sourced by `00_run_all_generators.R` or run directly). 2. It sources `questionnaire/merge_sections.R`. 3. `merge_sections.R` resolves the generator root via `resolve_generator_root()`, then sources all six section scripts in order. 4. Each section script writes one CSV under `questionnaire/outputs/`. 5. `merge_sections.R` reads those CSVs, merges them by `pid`, optionally aligns the column schema against the reference output from `bootstrap_source_data.R`, and writes `../../data/questionnaire_data.csv`. If `../../data/questionnaire_data.csv` does not exist on first run, every section script falls back to `get_questionnaire_data()` → `bootstrap_source_data.R` to obtain the base participant frame. ## Per-section generation pattern Every section script follows the same pattern: 1. Source `section_utils.R` (handles running from either `generator_scripts/` or `questionnaire/`). 2. Call `get_questionnaire_data()` to load the base participant frame. 3. Declare a `cols` vector of the section's column names. 4. Call `ensure_columns(questionnaire_data, cols)` to guarantee every column exists (fills with `NA_character_` if absent). 5. Call `fill_defaults_for_columns(questionnaire_data, cols)` for columns that need realistic probability distributions applied up-front (e.g. height/weight units, housing type, smoking status). 6. Subset to `section_df <- questionnaire_data[, cols]` and apply hand-coded generation logic for routing, version splits, and columns not in the PDF catalog. 7. Call `apply_pdf_value_catalog(section_df, questionnaire_data, cols)` to fill any remaining NA/numeric-placeholder columns from the PDF-derived answer catalog. 8. Call `write_section(section_df, "<section_name>")` to persist the output CSV. ## PDF value catalog `apply_pdf_value_catalog()` drives the answer-value generation for most columns: - Reads `questionnaire_column_unique_values_from_pdfs.csv` (looked up in `questionnaire/`, `questionnaire_analysis/`, or `inst/generator_scripts/questionnaire/`; result is cached per session). - For each column, the catalog records the question type (`categorical` or `numeric`), which questionnaire versions carry the question (`v1`, `v2`, or both), and the allowed answer strings per version. - Participants are split into v1/v2 cohorts using `questionnaire_version`; columns absent in a version are set to `NA` for that cohort. - **Categorical columns**: answers are sampled uniformly from the version-appropriate allowed-value list. - **Numeric columns**: values are generated by column-name pattern (`height`, `weight`, `_hrs_`, `_mins_`, `_age_`, `_yrs_`, `_num_`, `_days_`, `immigrate_uk_yr`), using realistic distributions (normal for height/weight, uniform for durations, integer ranges for counts/years). ## Script responsibilities - `questionnaire.R`: Base questionnaire identifiers/metadata. - Owns `id`, `pid`, `questionnaire_version`, `submission_date`. - `you_and_your_household.R`: Household and demographic columns (height, weight, language, relationship/civil status, housing type, tenure, energy, transport). - `work_and_education.R`: Work and education columns. - `your_lifestyle.R`: Activity, smoking, alcohol, sleep, social, and lifestyle columns. - `your_health.R`: Health, diagnoses, medications, reproductive, screening, PHQ9/GAD7, pain, and related columns. - `family_health.R`: Family history columns (parents/siblings), family diagnosis detail columns, and related family demographics. - `merge_sections.R`: Orchestrator and merge logic. - Sources `section_utils.R` then all six section scripts. - Reads `questionnaire/outputs/*.csv` and merges by `pid`. - Aligns final column order to the `bootstrap_source_data.R` reference schema (adds any missing columns; reorders to match). - Writes `../../data/questionnaire_data.csv` as UTF-8 BOM CSV. - `section_utils.R`: Shared helper functions used by all section scripts and merge (see below). - `bootstrap_source_data.R`: Minimal first-run bootstrap. Produces a `questionnaire_data` data frame with only `pid`, `questionnaire_version`, and `submission_date`. Used by `get_questionnaire_data()` when no existing output CSV is present, and by `merge_sections.R` as the reference schema for column alignment. ## What section_utils.R provides - `resolve_generator_root()` - Finds the `generator_scripts` directory from the current working directory. - Supports running from either `generator_scripts/` or `generator_scripts/questionnaire/`. - `ensure_columns(df, cols)` - Ensures every column in `cols` exists on `df`. - Missing columns are added as `NA_character_`. - `fill_defaults_for_columns(df, cols)` - Fills columns that remain all-NA with realistic sample distributions. - Covers a fixed set of columns (height/weight units, housing, energy, smoking, alcohol, health status, COVID, etc.) using survey-informed probabilities. - `get_questionnaire_data()` - Primary: reads existing `../../data/questionnaire_data.csv`. - Fallback: sources `questionnaire/bootstrap_source_data.R` if no CSV exists. - Normalizes all column names to lowercase. - `get_output_dir()` - Returns path to `questionnaire/outputs/` and creates the directory if needed. - `write_section(df, section_name)` - Normalizes mojibake encoding artifacts in character columns (curly quotes, pound sign, etc.). - Writes `questionnaire/outputs/<section_name>.csv` as UTF-8 BOM CSV. - Prints generated row/column counts. - `write_csv_utf8bom(df, path)` - Lower-level writer. Prepends a UTF-8 BOM so Excel detects encoding correctly, then appends via `write.table`. - `get_pdf_value_catalog()` / `get_pdf_column_values(col)` / `apply_pdf_value_catalog()` - Load, parse, and apply the PDF-derived answer catalog (see "PDF value catalog" above). - `sanitize_pdf_values(values)` / `split_pdf_values(s)` - Clean raw catalog strings: strip mojibake, collapse whitespace, remove parser artifacts (TOGGLE/SELECT directives, question fragments, trailing OR).
Versions across snapshots
| Version | Repository | File | Size |
|---|---|---|---|
0.1.1 |
rolling linux/jammy R-4.5 | ofhsyn_0.1.1.tar.gz |
390.1 KiB |
0.1.1 |
rolling linux/noble R-4.5 | ofhsyn_0.1.1.tar.gz |
389.9 KiB |
0.1.1 |
rolling source/ R- | ofhsyn_0.1.1.tar.gz |
252.6 KiB |
0.1.1 |
latest linux/jammy R-4.5 | ofhsyn_0.1.1.tar.gz |
390.1 KiB |
0.1.1 |
latest linux/noble R-4.5 | ofhsyn_0.1.1.tar.gz |
389.9 KiB |
0.1.1 |
latest source/ R- | ofhsyn_0.1.1.tar.gz |
252.6 KiB |
0.1.1 |
2026-04-23 source/ R- | ofhsyn_0.1.1.tar.gz |
0 B |