UKBAnalytica
UK Biobank Data Processing and Survival Analysis Toolkit
Provides an integrated workflow for UK Biobank Research Analysis Platform (RAP) hosted and RAP-generated analysis tables. The package supports RAP phenotype extraction planning, predefined variable sets and disease definitions, standardized baseline preprocessing, multi-source endpoint ascertainment, prevalent and incident case classification, survival-ready cohort construction, regression, multiple imputation, propensity score analysis, mediation analysis, subgroup and sensitivity analyses, machine learning, proteomics enrichment and protein-protein interaction analysis, and publication-oriented visualization. The package workflow is described in He et al. (2026) <doi:10.64898/2026.06.19.26356057>.
README
# UKBAnalytica Skill Pack (`UKBAnalytica_skills`)
Agent-runtime-agnostic skill bundle for the [`UKBAnalytica`](https://github.com/Hinna0818/UKBAnalytica)
R package. The pack works with any agent stack that supports a "skill" / "tool
description" loaded from YAML front-matter + Markdown (Claude Code skills,
OpenAI Assistants instructions / function-tool descriptions, custom RAG-driven
agents, etc.).
The pack lives at `inst/skills/UKBAnalytica_skills/` in the repository root.
Load it via `file.path(getwd(), "inst/skills/UKBAnalytica_skills")` or see
[INSTALL.md](INSTALL.md) for per-runtime loader snippets.
---
## Charter (read first)
All skills in this pack enforce the following non-negotiable rules:
1. **Script-generation boundary.** These skills help a local agent generate
scripts, analysis plans, package-usage guidance, and manuscript text from
aggregate outputs. They are not permission for an agent to read or process
real UK Biobank RAP participant-level data.
2. **RAP-only execution.** Scripts that touch real UK Biobank data must be run
by the user inside the approved UK Biobank Research Analysis Platform (RAP),
typically `RAP JupyterLab → Terminal → R`.
3. **No real rows in agent context.** Do not give the agent participant-level
files, R objects, screenshots, logs, tracebacks, row samples, `head(data)`,
`eid`, exact dates, raw RAP fields, per-row predictions, or row-level SHAP
matrices. This applies even when identifiers have been removed.
4. **Aggregate outputs only.** The user may share aggregate summaries with the
agent, including participant-flow counts, baseline tables, regression
summaries, model metrics, enrichment results, and rendered figures.
5. **Schema-only prompts.** The user may describe column names, variable roles,
and intended analyses. The agent uses synthetic toy data for smoke tests and
writes final scripts for RAP execution.
6. **Large extracts go async.** When the requested field count is large,
skills route the user through `rap_submit_extract()` (DNAnexus
`table-exporter`) rather than pulling everything into the R session.
7. **`UKBAnalytica` is the sole executor.** Skills wrap real exported
functions only — every function name in a skill must be present in
`NAMESPACE`.
8. **No journal-brand styling.** Plotting guidance uses neutral palette names
(`ukbsci_clinical`, `ukbsci_diverging`, `ukbsci_sequential`) and avoids
referencing specific top-tier journal brands.
---
## Provider compatibility
| Runtime | Loader pattern |
|---------|----------------|
| Claude Code skills | Drop `inst/skills/UKBAnalytica_skills/` into `~/.claude/skills/` (or workspace `.claude/skills/`); each `ukbsci-*/SKILL.md` is auto-discovered via its `name` + `description` front-matter. |
| OpenAI Assistants / Responses | Concatenate `SKILL.md` files into the assistant `instructions` (or attach as file-search documents); use the `description` field as the routing summary. |
| LangChain / LlamaIndex agents | Treat each `ukbsci-*/` directory as a Tool whose docstring = `description`, whose body = `SKILL.md` + `references/*.md`. |
| Generic JSON tool list | Read each `SKILL.md` YAML header → `{name, description}`; load body lazily when the agent decides to invoke. |
The front-matter is intentionally minimal (`name`, `description`) so it parses
identically across providers. No provider-specific keys are used. Regardless
of provider, these skills must not be used to send real participant-level RAP
data to the agent.
---
## Pack layout
```
inst/skills/UKBAnalytica_skills/
├── README.md ← this file
├── INSTALL.md ← per-provider install snippets
├── MANIFEST.json ← machine-readable index (name, description, path)
├── ukbsci-rap-extract/ (P2) RAP discover / plan / extract
├── ukbsci-cohort/ (P2) disease definitions + survival cohort
├── ukbsci-workflow/ (P2) end-to-end orchestrator
├── ukbsci-regression/ (P3) batch lm / logit / Cox + extensions
├── ukbsci-survival/ (P3) KM + risk table + log-rank
├── ukbsci-baseline/ (P3) tableone Table 1
├── ukbsci-propensity/ (P4) PS / PSM / IPTW / balance
├── ukbsci-mediation/ (P4) regmedint 4-way decomposition
├── ukbsci-subgroup-sensitivity/ (P4) subgroup × interaction + sensitivity
├── ukbsci-imputation/ (P4) mice + Rubin pooling
├── ukbsci-proteomics/ (P5) Olink / STRING / GO / KEGG / PPI
├── ukbsci-ml/ (P5) classification + survival ML + SHAP
├── ukbsci-preprocess/ (P5) variable cleaning + composites
└── ukbsci-plot/ (P6) forest / volcano / calibration / theme
```
Each skill directory contains:
```
SKILL.md ← YAML front-matter (name, description) + body
README.md ← human-readable overview
references/
functions.md ← every exported function signature + caveats
rap-guardrails.md← what is forbidden in this module
examples.md ← copy-pastable minimal examples
evals/
evals.json ← trigger-recall test cases for the skill router
```
---
## Trigger-word routing table
Every `SKILL.md` description ends with the phrase **"UK Biobank RAP" or "UKBAnalytica"**
to keep the router from confusing these skills with generic R / statistics
skills. Triggers below are written into the `description` field verbatim.
| Skill | English triggers | Chinese triggers |
|-------|------------------|------------------|
| `ukbsci-rap-extract` | UK Biobank RAP extract, dx extract_dataset, table-exporter, UKB field search, phenotype extraction | RAP 提取, 字段下载, table-exporter, UKB 字段搜索, 表型提取 |
| `ukbsci-cohort` | UKB cohort, disease definition, prevalent vs incident, survival dataset, ICD10 phenotyping | UKB 队列, 病例定义, prevalent vs incident, 生存数据集, ICD10 表型 |
| `ukbsci-workflow` | end-to-end UKB analysis, UKB pipeline, RAP-to-publication, full study plan | 端到端 UKB 分析, RAP 到论文, 完整流程, 项目计划 |
| `ukbsci-regression` | UKB regression, Cox model, logistic / linear, batch regression, PH diagnostics, competing risks, p_trend | UKB 回归, Cox 模型, 批量回归 |
| `ukbsci-survival` | UKB KM curve, Kaplan-Meier, log-rank, risk table | UKB 生存曲线, KM 曲线 |
| `ukbsci-baseline` | Table 1, baseline characteristics, demographics summary | 基线表, 基线特征 |
| `ukbsci-propensity` | propensity score, PSM, IPTW, ATE / ATT, Love plot, covariate balance | 倾向评分, 倾向得分匹配 |
| `ukbsci-mediation` | mediation, indirect effect, natural direct effect, TNIE, PNDE, proportion mediated | 中介分析 |
| `ukbsci-subgroup-sensitivity` | subgroup analysis, interaction, effect modification, sensitivity, complete-case, lag | 亚组分析, 敏感性分析 |
| `ukbsci-imputation` | multiple imputation, MI, mice, Rubin's rules, FMI | 多重插补, 插补合并 |
| `ukbsci-proteomics` | UKB proteomics, Olink, UKB-PPP, STRING PPI, GO ORA, KEGG ORA, MCODE | 蛋白组分析, 通路富集 |
| `ukbsci-ml` | UKB ML, XGBoost, random forest, SHAP, C-index, calibration, decision curve, AUC | UKB 机器学习, SHAP, 生存 ML |
| `ukbsci-preprocess` | UKB preprocessing, variable cleaning, negative code, derive BP / air pollution / diet score | UKB 变量预处理 |
| `ukbsci-plot` | UKB plotting, forest plot, volcano, calibration, manuscript figure | ukbsci 画图, 论文级图 |
---
## Status
All 14 skills shipped (v1.0.0). Phase grouping reflects the writing order:
| Phase | Skills | State |
|-------|--------|-------|
| P2 | `ukbsci-rap-extract`, `ukbsci-cohort`, `ukbsci-workflow` | shipped |
| P3 | `ukbsci-regression`, `ukbsci-survival`, `ukbsci-baseline` | shipped |
| P4 | `ukbsci-propensity`, `ukbsci-mediation`, `ukbsci-subgroup-sensitivity`, `ukbsci-imputation` | shipped |
| P5 | `ukbsci-proteomics`, `ukbsci-ml`, `ukbsci-preprocess` | shipped |
| P6 | `ukbsci-plot` | shipped |
See [`supp/UKBAnalytica-skill-roadmap.md`](../supp/UKBAnalytica-skill-roadmap.md)
for the full implementation roadmap.
---
## Citation
When agents emit R scripts based on this pack, the recommended citation header
at the top of every generated script is:
```r
###############################################################################
# UKBAnalytica Citation:
# He N. UKBAnalytica: Scalable Phenotyping and Statistical Pipeline for
# UK Biobank RAP Data. R package version 1.0.0.
# https://github.com/Hinna0818/UKBAnalytica
###############################################################################
```
Versions across snapshots
| Version | Repository | File | Size |
|---|---|---|---|
1.0.0 |
rolling linux/jammy R-4.5 | UKBAnalytica_1.0.0.tar.gz |
2.4 MiB |
1.0.0 |
rolling linux/noble R-4.5 | UKBAnalytica_1.0.0.tar.gz |
2.4 MiB |
1.0.0 |
rolling source/ R- | UKBAnalytica_1.0.0.tar.gz |
999.3 KiB |
1.0.0 |
latest linux/jammy R-4.5 | UKBAnalytica_1.0.0.tar.gz |
2.4 MiB |
1.0.0 |
latest linux/noble R-4.5 | UKBAnalytica_1.0.0.tar.gz |
2.4 MiB |
1.0.0 |
latest source/ R- | UKBAnalytica_1.0.0.tar.gz |
999.3 KiB |
1.0.0 |
2026-04-23 source/ R- | UKBAnalytica_1.0.0.tar.gz |
0 B |