Home / p/dqcheckr

dqcheckr

Automated Data Quality Checks for Recurring Dataset Deliveries

Automates quality verification of recurring external dataset deliveries. For each new file arrival, it runs single-snapshot quality checks, compares the file to the previous delivery, writes a self-contained 'HTML' report, and records summary statistics in a local 'SQLite' database for long-term trend tracking. Supports 'CSV' and fixed-width formats. Custom organisation-specific checks can be supplied as plain R files.

README

# dqcheckr demonstrations

Two runnable demonstrations using the Star Wars dataset from `dplyr`.

## Setup

Install dqcheckr, then copy this folder to a local working directory:

```r
# Install from GitHub
devtools::install_github("mickmioduszewski/dqcheckr")

# Copy demonstrations to ~/dqcheckr_demo
dest <- file.path(path.expand("~"), "dqcheckr_demo")
file.copy(system.file("demonstrations", package = "dqcheckr"),
          dirname(dest), recursive = TRUE)
file.rename(file.path(dirname(dest), "demonstrations"), dest)
```

Open the copied folder in RStudio and set the working directory to it:
**Session → Set Working Directory → To Source File Location**

---

## demo.R — Named-file mode

Checks the Star Wars dataset in both CSV and fixed-width format. Uses explicit
`current_file:` paths in the config — no folder scan.

```
data/
  starwars.csv        current file (CSV)
  starwars.fwf        current file (fixed-width)
config/
  dqcheckr.yml        global rules
  starwars_csv.yml    CSV dataset config
  starwars_fwf.yml    FWF dataset config
output/
  reports/            HTML reports written here
  snapshots.sqlite    run history written here
```

Run: open `demo.R` and source it. Two HTML reports open in the browser.

Expected result: **FAIL** on both — `vehicles` and `starships` columns
are genuinely sparse (87% and 77% missing) so they exceed the missing-rate
threshold. All other checks pass.

---

## demo2.R — Folder-scan mode with version comparison and custom checks

Compares two deliveries of the Star Wars dataset using automatic folder-scan
file detection. Applies custom human-specific checks.

```
data2/
  starwars_v1.csv     original data  (previous delivery)
  starwars_v2.csv     perturbed data (current delivery)
config2/
  dqcheckr.yml        global rules
  starwars_folder.yml folder-scan dataset config
custom2/
  starwars_custom.R   custom checks for human characters
output2/
  reports/            HTML reports written here
  snapshots.sqlite    run history written here
```

**Perturbations in v2** (all within configured thresholds — QC checks pass):

| Column | Change | Effect |
|--------|--------|--------|
| `height` | +15 to all values | ~8.6% mean shift — triggers CP-04 |
| `mass` | ×1.10 | ~10% mean shift — triggers CP-04 |
| `birth_year` | −5 to all values | ~5.7% mean shift — triggers CP-04 |
| `homeworld` | 5 characters → "Exegol" | new distinct value — triggers CP-05 |
| `species` | 1 character → "Chiss" | new distinct value — triggers CP-05 |

**Custom checks** (`custom2/starwars_custom.R`):

| Check | Condition | Status |
|-------|-----------|--------|
| CC-01 | Human with height > 210 | FAIL |
| CC-02 | Human with height 200–210 | WARN |
| CC-03 | Human with mass > 140 | WARN |

Run: open `demo2.R` and source it. One HTML report opens in the browser
showing QC checks, custom checks, and version comparison.

Expected result: **FAIL** — `vehicles`/`starships` missing rate (standard)
and CC-01 Human height (custom). Version comparison shows distribution
shifts across height, mass, and birth_year, plus new distinct values in
homeworld and species.

Versions across snapshots

Version Repository File Size

0.1.2 rolling linux/jammy R-4.5 dqcheckr_0.1.2.tar.gz 185.5 KiB

0.1.2 rolling linux/noble R-4.5 dqcheckr_0.1.2.tar.gz 185.2 KiB

0.1.2 rolling source/ R- dqcheckr_0.1.2.tar.gz 99.4 KiB

0.1.2 latest linux/jammy R-4.5 dqcheckr_0.1.2.tar.gz 185.5 KiB

0.1.2 latest linux/noble R-4.5 dqcheckr_0.1.2.tar.gz 185.2 KiB

0.1.2 latest source/ R- dqcheckr_0.1.2.tar.gz 99.4 KiB

0.1.2 2026-04-23 source/ R- dqcheckr_0.1.2.tar.gz 0 B

Version	Repository	File	Size
`0.1.2`	rolling linux/jammy R-4.5	`dqcheckr_0.1.2.tar.gz`	185.5 KiB
`0.1.2`	rolling linux/noble R-4.5	`dqcheckr_0.1.2.tar.gz`	185.2 KiB
`0.1.2`	rolling source/ R-	`dqcheckr_0.1.2.tar.gz`	99.4 KiB
`0.1.2`	latest linux/jammy R-4.5	`dqcheckr_0.1.2.tar.gz`	185.5 KiB
`0.1.2`	latest linux/noble R-4.5	`dqcheckr_0.1.2.tar.gz`	185.2 KiB
`0.1.2`	latest source/ R-	`dqcheckr_0.1.2.tar.gz`	99.4 KiB
`0.1.2`	2026-04-23 source/ R-	`dqcheckr_0.1.2.tar.gz`	0 B

Dependencies (latest)

Imports

Suggests

testthat (>= 3.0.0)