The examples below demonstrate common schema command permutations using the bundled Big 5 player statistics dataset located under tests/data.
Inspect a handful of rows and review inferred samples, format hints, and override status directly in the console:
csv-managed schema probe -i tests/data/big_5_players_stats_2023_2024.csv --sample-rows 5Output highlights:
- Table columns: raw name, inferred type, rename (if any), override flag, sample value, and format hint.
- Footer summarises how many rows were scanned and whether any decode errors were encountered.
Generate a schema file populated with snake_case renames and empty replacement arrays so you can fill in value substitutions later:
csv-managed schema infer --mapping --replace-template `
-i tests/data/big_5_players_stats_2023_2024.csv `
-o tmp/big5_inferred-schema.yml --sample-rows 0This command performs a full scan (--sample-rows 0) before writing tmp/big5_inferred-schema.yml.
Force specific column types while still injecting automatic renames for downstream processing:
csv-managed schema infer --mapping `
--override Performance_Gls:integer `
--override "Per 90 Minutes_Gls:string" `
-i tests/data/big_5_players_stats_2023_2024.csv `
-o tmp/big5_overrides-schema.yml --sample-rows 10The resulting schema keeps inferred types for all other columns, but Performance_Gls becomes integer and Per 90 Minutes_Gls becomes string with the rename per_90_minutes_gls.
Review the inference table and the exact YAML that would be written—complete with optional template scaffolding—without touching the filesystem:
csv-managed schema infer --mapping --replace-template \
-i tests/data/big_5_players_stats_2023_2024.csv \
--sample-rows 25 --preview--preview prints the same probe table seen with schema probe, followed by the YAML that would be emitted (honouring --replace-template, NA placeholder handling, and overrides). Supplying -o tmp/big5_schema.yml alongside --preview performs a dry run; the file is skipped until you rerun without --preview.
Review how a fresh inference differs from a previously saved schema before overwriting it:
csv-managed schema infer -i tests/data/big_5_players_stats_2023_2024.csv --sample-rows 0 --diff tmp/big5_inferred-schema.ymlWhen differences exist, the command emits a unified diff (--- existing / +++ inferred) showing the exact YAML changes. No files are modified, making it safe to audit changes prior to accepting them. Combine --diff with --preview to inspect both the diff and the would-be output in a single run.
Treat placeholder tokens (NA, N/A, #NA, #N/A) as empty for datatype voting and inject replacement entries into the written schema that fill them with a canonical token (e.g. NULL).
csv-managed schema infer --mapping \
--na-behavior fill --na-fill NULL \
-i tests/data/big_5_players_stats_2023_2024.csv \
-o tmp/big5_na-schema.yml --sample-rows 250Excerpt from generated tmp/big5_na-schema.yml (simplified):
columns:
- name: Player
datatype: String
rename: player
replace:
- value: "NA"
replacement: "NULL"
- value: "N/A"
replacement: "NULL"
- value: "#NA"
replacement: "NULL"
- value: "#N/A"
replacement: "NULL"
- name: Performance_Gls
datatype: Integer
rename: performance_gls
replace:
- value: "NA"
replacement: "NULL"
- value: "N/A"
replacement: "NULL"
- value: "#NA"
replacement: "NULL"
- value: "#N/A"
replacement: "NULL"Using --na-behavior empty instead would map each placeholder to an empty string (""), allowing downstream transforms or replacements to interpret the value as genuinely missing.
Run probe without writing a file to see recommended replacements:
csv-managed schema probe -i tests/data/big_5_players_stats_2023_2024.csv --na-behavior empty --sample-rows 100The console output appends a section:
Placeholder Suggestions (replace with empty string):
• Performance_Gls (Integer (non-string))
tokens: NA (3), #N/A (1)
replacements:
- from "NA" -> to ""
- from "#N/A" -> to ""
Persist these by re-running with schema infer.
The inference engine selects a column's datatype via majority voting across sampled (or full scan) non-empty values. Below are illustrative scenarios you can reproduce by crafting small CSV snippets.
scores.csv:
score
10
22
5
7
14
3
9
11
8
6
12
4
All 12 values parse cleanly as Integer (and implicitly as Float). Integer holds a 100% majority so the inferred type is Integer.
csv-managed schema probe -i scores.csv --sample-rows 0mixed_numeric.csv:
amount
10
22.5
5
7.75
14
3.10
9.0
11
8
Votes: Integer (6), Float (9). Integer does not exceed 50% of parsed values, so Float (plurality) is selected.
precise.csv:
measurement
1.2345
2.1000
3.0000
4.9999
5.1234
Each value fits scale 4; the engine infers decimal(18,4) (precision shown may vary based on max digits encountered). If a minority value had 5 decimals, Decimal would lose majority and fall back to Float (plurality) unless overridden.
temporal_mixed.csv:
stamp
2024-01-01
2024-01-02 04:30:00
2024-01-03
2024-01-04 17:15:00
misc
Votes: Date (2), DateTime (2), String (1). No majority; specificity ordering prefers the simpler canonical subset when exact tie—result: Date (because Date tokens can be parsed uniformly while DateTime adds optional time components without majority advantage). If an additional DateTime line were added to break the tie, DateTime would win.
prices.csv:
amount
$12.00
14
15
All three values satisfy currency parsing (0 or 2 decimal places) and one-third of the sample bears a currency symbol. Inference now promotes the column to Currency before integer/float evaluation, preventing price columns like this from being labelled Integer.
If a mostly-numeric ID column contains a few non-numeric placeholders (e.g., NA, #N/A) those rows can spoil majority. Provide replacements or force type:
csv-managed schema infer -i ids.csv -o ids-schema.yml --override id:Integer --sample-rows 0Planned enhancement (backlog) will allow treating NA, N/A, #NA, #N/A as empty votes so they no longer dilute numeric or temporal majorities. Until then, consider adding explicit replacements (replace: arrays) for such tokens.
- Increase
--sample-rowsbeyond the default (2000) when early rows are skewed. - Keep at least roughly one-third of sampled rows with currency symbols if you want automatic Currency detection; otherwise supply overrides.
- Use
--snapshotafter confirming inference to lock the layout & types. - Inspect the probe table's sample and format hints for confirmation before persisting a schema.
Capture the current probe table rendering and fail future runs if the layout changes unexpectedly:
csv-managed schema probe -i tests/data/big_5_players_stats_2023_2024.csv --sample-rows 5 --snapshot tmp/big5_probe.snapOn first execution the file tmp/big5_probe.snap is created. Subsequent executions compare the live output to the stored snapshot and return a non-zero exit code if the formatting differs.
Schema files can declare transformation steps that run before value replacements or final type parsing. Below is the same example expressed in YAML (preferred) converting ISO‑8601 timestamps into bare dates and rounding verbose decimals to four places:
columns:
- name: order_ts
datatype: Date
datatype_mappings:
- from: String
to: DateTime
- from: DateTime
to: Date
- name: amount
datatype: Float
datatype_mappings:
- from: String
to: Float
strategy: round
options:
scale: 4The process, append, stats, and schema verify commands automatically apply these mappings unless you opt out with --skip-mappings. Use --apply-mappings to enforce them when chaining custom workflows.
Key rules and capabilities when authoring datatype_mappings:
- Capitalization: Data types in production schema files should be capitalized (
String,Integer,Float,Boolean,Date,DateTime,Time,Guid,Currency). All YAML examples below follow this convention. - Ordering: Mappings are applied in declaration order, top to bottom. Each mapping consumes the previous output.
- Options: A mapping may include an
optionsobject to guide parsing or rendering (e.g., a custom datetimeformat). - Strategies: Supported
strategyvalues (case-insensitive) by context:round(numeric, including Currency),truncate(numeric to Integer or Currency scale adjustment),trim/lowercase/uppercase(String→String). Rounding uses ascaleinoptions(defaults to4for Float; Currency requires an explicit allowed scale of2or4). - Error handling: If any mapping step fails (e.g., a datetime parse), the entire row fails verification for that column.
If your timestamp includes a trailing Z (UTC designator) or other formatting not covered by the built‑in fallbacks, supply an explicit chrono format string:
name: ordered_raw
datatype: Date
rename: ordered_at
datatype_mappings:
- from: String
to: DateTime
options:
format: "%Y-%m-%dT%H:%M:%SZ"
- from: DateTime
to: DateWhen no options.format is provided for a String → DateTime mapping, the parser attempts these formats in order:
%Y-%m-%d %H:%M:%S
%Y-%m-%dT%H:%M:%S
%d/%m/%Y %H:%M:%S
%m/%d/%Y %H:%M:%S
%Y-%m-%d %H:%M
%Y-%m-%dT%H:%M
To parse timestamps that include timezone markers (Z, offsets) or fractional seconds, you must provide a matching options.format string.
| Token | Meaning |
|---|---|
%Y |
4‑digit year |
%m |
Month (01–12) |
%d |
Day of month (01–31) |
%H |
Hour (00–23) |
%M |
Minute (00–59) |
%S |
Second (00–60) |
%f |
Fractional seconds (nanoseconds) |
%z |
Offset like +0000 (requires explicit format & non‑naive parsing) |
%Z |
Timezone name/abbrev (not applied to naive conversions) |
For ISO‑8601 with a trailing Z (e.g. 2024-04-01T08:30:00Z), include the literal Z at the end of the pattern (%Y-%m-%dT%H:%M:%SZ).
- Raw value ingested.
datatype_mappingschain executes.- Value replacements (if any) apply.
- Final typed parsing validates against the column's declared
datatype.
Design mappings so the last step produces a representation parsable as the target datatype.
The Currency datatype enforces a fixed scale of either 2 or 4 decimal places. Parsing accepts common symbols ($, €, £, ¥), thousands separators (,), and negative formats (leading - or parentheses). Any value not matching an allowed scale is rejected during verification.
columns:
- name: gross_amount_raw
datatype: Currency
datatype_mappings:
- from: String
to: Currency
strategy: round
options:
scale: 2columns:
- name: fx_rate_raw
datatype: Currency
datatype_mappings:
- from: String
to: Currency
strategy: truncate
options:
scale: 4If upstream processing produced a Float and you need a canonical currency with 2 decimals, you can map directly.
columns:
- name: net_amount
datatype: Currency
datatype_mappings:
- from: Float
to: Currency
strategy: round
options:
scale: 2round: Midpoint rounding is applied away from zero to enforce the target scale.truncate: Removes excess fractional digits without rounding (e.g.,123.456789with scale 4 becomes123.4567).- Scale must be explicitly supplied for Currency mappings (
2or4). Omitting scale yields a validation error.
Currency columns participate in numeric aggregations (min, max, mean, std dev) and frequency displays. The original scale is preserved for consistent formatting during output.
Common failure conditions for Currency parsing:
- Invalid symbol placement (e.g.,
123$instead of$123). - Unsupported scale (
123.456→ 3 decimals is rejected). - Mixed thousands separators and spaces (e.g.,
$1, 234.00). - Parentheses with a leading minus (
-(123.45)), double negative. - Trailing or leading stray characters (
$123.45USD).
Design your mappings to sanitize upstream data before conversion when necessary.