Ambiguous database-to-derivation mapping in variables.csv #155
Replies: 1 comment 2 replies
-
|
The details about which variables go with which database in the case where there are multiple derived variables (DerivedVar entires) are in the variable details sheet. Here are the relevant entries for active_transport and energy_exp. For example for active_transport, That being said, I actually wanted to change that so that it all of this information is in the variables sheet using one of your options. Option C would be great but Option B is a good in between. One issue with Option B though is that it makes it harder programmatically to parse out this information as in the variable start strings |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
The current
variables.csvschema allowsvariableStartto contain both[VAR]defaults andDerivedVar::entries without explicit association to specific databases. This creates ambiguity about which mapping applies to which database/cycle.Example 1:
active_transportLocation:
inst/extdata/variables.csvline 2Current entry:
active_transportcchs2001_p, cchs2003_p, cchs2005_p, cchs2007_2008_p, cchs2009_2010_p, cchs2010_p, cchs2011_2012_p, cchs2012_p, cchs2013_2014_p, cchs2014_p, cchs2015_2016_p, cchs2017_2018_p, cchs2009_m, cchs2010_m, cchs2012_mDerivedVar::[PAC_4A_cont, PAC_4B_cont], DerivedVar::[PAC_7, PAC_7A, PAC_7B_cont, PAC_8, PAC_8A, PAC_8B_cont], DerivedVar::[PAYDVTTR, PAADVTRV]The problem: There are 15 databases listed, but three different
DerivedVar::formulas with no indication of which formula applies to which databases.Example 2:
energy_expLocation:
inst/extdata/variables.csvline 114Current entry:
energy_expcchs2001_p, cchs2003_p, cchs2005_p, cchs2007_2008_p, cchs2009_2010_p, cchs2010_p, cchs2011_2012_p, cchs2012_p, cchs2013_2014_p, cchs2014_p, cchs2015_2016_p, cchs2017_2018_p, cchs2009_s, cchs2010_s, cchs2012_scchs2001_p::PACADEE, cchs2003_p::PACCDEE, cchs2005_p::PACEDEE, [PACDEE], DerivedVar::[DHHGAGE_cont, PAA_045, PAA_050, PAA_075, PAA_080, PAADVDYS, PAADVVIG, PAYDVTOA, PAYDVADL, PAYDVDYS]The problem: There are 15 databases listed, with three different mapping types:
cchs2001_p::PACADEE,cchs2003_p::PACCDEE,cchs2005_p::PACEDEE[PACDEE]- applies to which remaining cycles?DerivedVar::[DHHGAGE_cont, PAA_045, ...]- applies to which cycles?After subtracting the 3 explicit mappings, there are 12 remaining databases. But there's no way to know which of the 12 use
[PACDEE](pass-through) vs theDerivedVar::formula.Why this matters
Contrast with pass-through variables
For pass-through mappings like
ADL_01, the pattern works well:Here, explicit
db::VARmappings cover specific cycles, and[ADL_01]is the default for remaining databases. The semantics are clear because there's only one default pattern.Possible solutions
Option A: Explicit database prefixes for DerivedVar
Pros: Fully explicit, validates easily Cons: Verbose, repetitive
Option B: New syntax with database grouping
Pros: Compact, explicit Cons: Requires parser changes
Option C: Remove
variableStartfromvariables.csvThe database-to-source-variable mapping already exists in
variable_details.csvwith full explicitdb::VARsyntax. ThevariableStartcolumn invariables.csvis:variable_details.csv)db::VAR,[DEFAULT],DerivedVar::[])Pros: Eliminates ambiguity entirely, single source of truth Cons: Loss of quick overview in
variables.csv, migration effortRecommendation
I lean toward Option B (database grouping syntax) for these reasons:
Quick reference value:
variables.csvserves as a human-readable index. Seeing[cchs2001_p, cchs2003_p]::DerivedVar::[...]at a glance tells you the variable's derivation pattern without openingvariable_details.csvStructural clarity: The grouping syntax makes the database-to-mapping relationship explicit while keeping
variables.csvas the canonical list of harmonized variablesIncremental migration: Option B can be adopted incrementally for ambiguous cases, whereas Option C requires a larger schema change
That said, Option C has a compelling simplicity argument—one source of truth is easier to maintain than two that must stay synchronized. The deciding factor may be: how often do people consult
variableStartdirectly vs using tooling that reads fromvariable_details.csv?Questions for discussion
Related
This issue was discovered while validating smoking variable harmonization worksheets. The smoking variables use explicit
db::VARmappings throughout, which made validation straightforward.Beta Was this translation helpful? Give feedback.
All reactions