Skip to content

Extend DiD Classes #292

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 285 commits into from
Apr 25, 2025
Merged
Show file tree
Hide file tree
Changes from 47 commits
Commits
Show all changes
285 commits
Select commit Hold shift + click to select a range
89eab05
add tune to did_binary
SvenKlaassen Feb 13, 2025
8ec69e6
add sensitivity to did binary
SvenKlaassen Feb 13, 2025
86e34a3
add external predictions to did binary
SvenKlaassen Feb 13, 2025
7f2892e
add exception for g_value=0
SvenKlaassen Feb 13, 2025
1616eae
formatting
SvenKlaassen Feb 13, 2025
15b4c27
update utils did (not completed)
SvenKlaassen Feb 13, 2025
47dc23d
fix input checks
SvenKlaassen Feb 14, 2025
7f9018e
extend checks
SvenKlaassen Feb 14, 2025
704027d
rename did utils
SvenKlaassen Feb 14, 2025
786937e
sort panel data t and g values
SvenKlaassen Feb 14, 2025
d5110d2
basic did binary version
SvenKlaassen Feb 14, 2025
add00c4
test did_binary vs did
SvenKlaassen Feb 13, 2025
035f00d
remove unnecessary tune list
SvenKlaassen Feb 18, 2025
2162adf
remove default values for g_value, t_value_pre and t_value_eval
SvenKlaassen Feb 18, 2025
9a44ebc
add never treated and force same dtype for g and t values
SvenKlaassen Feb 18, 2025
d6d3605
add tests for get_never_treated_value
SvenKlaassen Feb 18, 2025
a45f7b4
define _is_never_treated
SvenKlaassen Feb 18, 2025
7e7bbf2
update dgp and tests to int treatment
SvenKlaassen Feb 19, 2025
e32bc00
add make_did_CS2021
SvenKlaassen Feb 19, 2025
f99605a
Add different time types and make never treated optional
SvenKlaassen Feb 19, 2025
d33c4fa
fix format
SvenKlaassen Feb 19, 2025
9a91511
remove init_pred from did_binary
SvenKlaassen Feb 21, 2025
39a13de
remove _set_score_elements from did_binary
SvenKlaassen Feb 21, 2025
2294faf
add option to allow for nan d values
SvenKlaassen Feb 21, 2025
f7b9621
add balanced panel check
SvenKlaassen Feb 21, 2025
ee0f51d
add new return type test for did
SvenKlaassen Feb 21, 2025
4ca3c7c
fix format
SvenKlaassen Feb 24, 2025
1211b17
add .coverage to gitignore
SvenKlaassen Feb 24, 2025
a2af62a
update return type test
SvenKlaassen Feb 24, 2025
2a3fbe9
update CS2021 dgp
SvenKlaassen Feb 24, 2025
283a438
rename ext pred test did_binary
SvenKlaassen Feb 24, 2025
92b4f2d
update coefficient matrix calculations in DGP for CS2021
SvenKlaassen Feb 24, 2025
86de9d0
add shape checks for predictions, targets and loss
SvenKlaassen Feb 24, 2025
0ab44d9
add test for predictions
SvenKlaassen Feb 24, 2025
51b25e7
add senstivity return type checks for did
SvenKlaassen Feb 24, 2025
97c4f40
extend arrays to all ids
SvenKlaassen Feb 25, 2025
4e2a769
add placebo test
SvenKlaassen Feb 25, 2025
ff82039
update p estimate to full sample
SvenKlaassen Feb 25, 2025
cc9b68d
add different tests for panel vs two period settings
SvenKlaassen Feb 25, 2025
5e3bb07
formatting and removing old test
SvenKlaassen Feb 25, 2025
9c357f6
remove not relevant exception
SvenKlaassen Feb 25, 2025
ab592e7
add nuisance loss for ext pred in did_binary
SvenKlaassen Feb 25, 2025
c126999
update external predictions for did
SvenKlaassen Feb 25, 2025
93b90ba
update ml_m target for did
SvenKlaassen Feb 25, 2025
68b1cad
add nuisance loss test for did_binary
SvenKlaassen Feb 25, 2025
0dd3999
reformat
SvenKlaassen Feb 25, 2025
5f2ad93
extend exception tests
SvenKlaassen Feb 26, 2025
170efcd
add check for not yet treated
SvenKlaassen Feb 26, 2025
bd68fc5
remove import
SvenKlaassen Feb 26, 2025
481e2b8
adapt docsting did_binary
SvenKlaassen Feb 26, 2025
ceb2723
add scaling for sensitivity elements did binary
SvenKlaassen Feb 26, 2025
22383c1
reformat
SvenKlaassen Feb 26, 2025
75f82ed
fix docstring
SvenKlaassen Feb 26, 2025
fc58f24
First did multi class
SvenKlaassen Feb 26, 2025
675111f
add control group to DoubleMLDIDMulti
SvenKlaassen Feb 26, 2025
07d155c
add unit tests for set and get id positions
SvenKlaassen Feb 27, 2025
bea11ed
change to gt_combinations in multi
SvenKlaassen Feb 27, 2025
acab136
check score exception
SvenKlaassen Feb 27, 2025
c947052
add trimming checks
SvenKlaassen Feb 27, 2025
34100bb
add classifier check
SvenKlaassen Feb 27, 2025
67810e0
add model initialization do multiperiod did
SvenKlaassen Feb 27, 2025
59bcea7
fix import
SvenKlaassen Feb 27, 2025
4c2d5c6
first version of default tests and fit()
SvenKlaassen Feb 27, 2025
08b7538
add gt label property
SvenKlaassen Feb 28, 2025
d913eee
set sample splitting error to not implemented
SvenKlaassen Feb 28, 2025
cdef60a
fix typo
SvenKlaassen Feb 28, 2025
2879aba
add basic default tests
SvenKlaassen Feb 28, 2025
e75602d
extent properties did multi
SvenKlaassen Feb 28, 2025
1634c21
add default tests for did
SvenKlaassen Feb 28, 2025
a9ad220
remove sample splitting from defaults
SvenKlaassen Feb 28, 2025
1dc1f62
enable to set sample splitting for panel data in did binary
SvenKlaassen Feb 28, 2025
2d82375
enable ext pred for panel data did binary
SvenKlaassen Feb 28, 2025
6a650ba
extend binary ext_pred tests
SvenKlaassen Feb 28, 2025
89dd0e3
enable ext predictions for did multi
SvenKlaassen Feb 28, 2025
51e881b
reformat
SvenKlaassen Feb 28, 2025
63e6ea6
test did binary vs multi
SvenKlaassen Feb 28, 2025
1d626b0
add external prediction tests for did multi
SvenKlaassen Feb 28, 2025
2a6939c
extend default tests
SvenKlaassen Feb 28, 2025
d12016e
add p-value adjustment to did multi
SvenKlaassen Feb 28, 2025
f0781dd
remove pre treatment drop
SvenKlaassen Feb 28, 2025
1fe351d
add control group check to utils
SvenKlaassen Mar 3, 2025
9beade3
update checks on gt_values and combinations
SvenKlaassen Mar 3, 2025
b2fb9db
basic input validation for gt_combinations
SvenKlaassen Mar 3, 2025
01178f0
test combinations for strings and lists
SvenKlaassen Mar 3, 2025
a7179d7
further input checks
SvenKlaassen Mar 3, 2025
d7da748
add default options for gt_combinations
SvenKlaassen Mar 3, 2025
f9a3755
fix gt_combination constructions
SvenKlaassen Mar 3, 2025
79011e2
return type tests for did binary
SvenKlaassen Mar 3, 2025
e48c74a
extend tests
SvenKlaassen Mar 3, 2025
4a48b69
fix format
SvenKlaassen Mar 3, 2025
5522c6b
add gt_index
SvenKlaassen Mar 3, 2025
1da6c5b
add nuisance loss to did multi
SvenKlaassen Mar 3, 2025
d9ced28
fix loss did multi
SvenKlaassen Mar 3, 2025
c6771a4
fix summary did multi
SvenKlaassen Mar 3, 2025
f7a6ffd
placeholder for aggregation
SvenKlaassen Mar 3, 2025
36d8139
check fit for aggregate
SvenKlaassen Mar 3, 2025
41b8368
first aggregation
SvenKlaassen Mar 4, 2025
529fd2b
add overall to aggregation
SvenKlaassen Mar 4, 2025
6fa74d2
adjust test
SvenKlaassen Mar 4, 2025
80c9130
add basic aggregation test
SvenKlaassen Mar 4, 2025
e4728a9
fix agg name
SvenKlaassen Mar 4, 2025
e58750f
add weighting
SvenKlaassen Mar 4, 2025
ef2cb00
update overall aggregation
SvenKlaassen Mar 4, 2025
3195221
refactor aggregation
SvenKlaassen Mar 4, 2025
bac62a9
add weights_mask
SvenKlaassen Mar 4, 2025
e8f8f0e
create post_treatment mask
SvenKlaassen Mar 5, 2025
a181f7e
update weighted aggregation
SvenKlaassen Mar 5, 2025
4d4ec11
fix mask combination
SvenKlaassen Mar 5, 2025
bc684bd
add overall weight rescaling
SvenKlaassen Mar 5, 2025
805bf16
fix deprecated loss assingment
SvenKlaassen Mar 5, 2025
dc9a5c1
simplify aggregation
SvenKlaassen Mar 5, 2025
6a20a9b
update tests
SvenKlaassen Mar 5, 2025
82806ff
fix string
SvenKlaassen Mar 5, 2025
63409b2
move did_utils
SvenKlaassen Mar 5, 2025
03f75ef
add aggregation checks
SvenKlaassen Mar 5, 2025
f9313d4
rename check agg test
SvenKlaassen Mar 6, 2025
fd61c9b
test group agg weights
SvenKlaassen Mar 6, 2025
c64929c
update aggregation in did multi class
SvenKlaassen Mar 6, 2025
5e7b57a
add aggregation obj
SvenKlaassen Mar 6, 2025
a24e9c8
fix format
SvenKlaassen Mar 6, 2025
4d15827
fix aggregation str
SvenKlaassen Mar 6, 2025
50b7223
fix summary aggregation
SvenKlaassen Mar 6, 2025
d4bc956
add additional info to aggregation object
SvenKlaassen Mar 6, 2025
41fa2b3
fix aggregation method add info input
SvenKlaassen Mar 6, 2025
86bd1b9
fix overall summary
SvenKlaassen Mar 6, 2025
d22841f
fix typo
SvenKlaassen Mar 6, 2025
29a7037
add simple input checks to did aggregation
SvenKlaassen Mar 6, 2025
572ae85
refactor DoubleMLAggregation
SvenKlaassen Mar 6, 2025
541aef4
update did_multi aggregation
SvenKlaassen Mar 6, 2025
5bc1e56
update did aggregation utils
SvenKlaassen Mar 6, 2025
72bc2a9
fix format
SvenKlaassen Mar 6, 2025
b3172bd
update aggregation method name
SvenKlaassen Mar 6, 2025
406267d
add weight masks as additional parameters to did aggregation
SvenKlaassen Mar 6, 2025
36d7b4f
add default and return type tests for aggregation
SvenKlaassen Mar 6, 2025
6e6cfaf
add aggregation tests
SvenKlaassen Mar 6, 2025
acab0da
add scaling comment to framework
SvenKlaassen Mar 6, 2025
ac5c354
add _str()_ test to did aggregation
SvenKlaassen Mar 6, 2025
575454c
check external prediction exceptions
SvenKlaassen Mar 6, 2025
f22cb6a
remove missing irm_data
SvenKlaassen Mar 7, 2025
d700448
add exception tests for panel data
SvenKlaassen Mar 7, 2025
072007e
add str() unit test for panel data and fix format
SvenKlaassen Mar 7, 2025
7425ffa
add property tests for panel data
SvenKlaassen Mar 7, 2025
65baaaa
add test for cluster data str
SvenKlaassen Mar 7, 2025
960421f
add str test to model defaults
SvenKlaassen Mar 7, 2025
8f35f6d
set n_obs for panel data to effective sample size for resampling
SvenKlaassen Mar 7, 2025
3b94b77
add did_multi return type tests
SvenKlaassen Mar 7, 2025
bad1aa8
set p_adjust value names
SvenKlaassen Mar 7, 2025
5223688
add placebo test for did_multi and formatting
SvenKlaassen Mar 7, 2025
95f15ec
update binary placebo test
SvenKlaassen Mar 7, 2025
e10caab
remove smpls property from did multi
SvenKlaassen Mar 7, 2025
a7305ad
add exception test for methods before fit()
SvenKlaassen Mar 7, 2025
5da841b
add unit tests for sensitivity benchmark
SvenKlaassen Mar 7, 2025
f212a3b
add data exception tests for did binary
SvenKlaassen Mar 7, 2025
2a356ca
formatting
SvenKlaassen Mar 7, 2025
fca7fe9
remove data check todo
SvenKlaassen Mar 7, 2025
fc1e655
add property types tests for did binary
SvenKlaassen Mar 7, 2025
2d6423d
add unit tests for did binary inputs
SvenKlaassen Mar 7, 2025
af9f515
test stdout for did binary
SvenKlaassen Mar 7, 2025
250839d
first verstion of time effects
SvenKlaassen Mar 7, 2025
8eb13a6
add todo tests
SvenKlaassen Mar 7, 2025
0f612b4
update time weights
SvenKlaassen Mar 7, 2025
d922ea7
fix time aggregation scaling
SvenKlaassen Mar 7, 2025
11e8cfa
normalize weight masks data
SvenKlaassen Mar 7, 2025
2a285ab
Move group weights outsite of time loop
SvenKlaassen Mar 7, 2025
7feb280
add tests for time aggregation
SvenKlaassen Mar 7, 2025
33aa4c1
fix summary test
SvenKlaassen Mar 10, 2025
e13b33a
renameing id in panel data
SvenKlaassen Mar 10, 2025
06e73a2
first version of eventstudy weights
SvenKlaassen Mar 10, 2025
2e4f4e0
fix eventstudy check
SvenKlaassen Mar 10, 2025
9cd138b
fix sign eventstudy
SvenKlaassen Mar 10, 2025
2e75c85
add checks for event study weights
SvenKlaassen Mar 10, 2025
3a13428
fix format
SvenKlaassen Mar 10, 2025
abe8c2d
update event_study aggregation weights
SvenKlaassen Mar 10, 2025
b585f19
add options to panel data for time unit
SvenKlaassen Mar 10, 2025
233a30f
extend summary with control group etc
SvenKlaassen Mar 10, 2025
5a44af6
add effective sample size to summary
SvenKlaassen Mar 10, 2025
a7df4da
remove gt combinations from str
SvenKlaassen Mar 10, 2025
85be2d3
add linebread
SvenKlaassen Mar 10, 2025
5e76401
make check for ml_m dependent on score
SvenKlaassen Mar 17, 2025
527e4af
add warning for learner ml_m with experimental score
SvenKlaassen Mar 17, 2025
43dd382
add seaborn dependence
SvenKlaassen Mar 17, 2025
5f03967
add aggregation plot
SvenKlaassen Mar 17, 2025
639feda
add aggregation to init
SvenKlaassen Mar 17, 2025
cd00a63
add warning for joint CI and bootstrapping
SvenKlaassen Mar 17, 2025
72bcf36
add tests for did multi plots
SvenKlaassen Mar 17, 2025
fde29d9
fix color idx
SvenKlaassen Mar 17, 2025
bceaa0c
improve boostrap warning
SvenKlaassen Mar 17, 2025
eaad631
add df to did multi
SvenKlaassen Mar 17, 2025
00ba32c
create add_jitter function
SvenKlaassen Mar 17, 2025
93bd997
basic plot function for did multi
SvenKlaassen Mar 17, 2025
67b4c40
add basic plot tests
SvenKlaassen Mar 17, 2025
9e77bf3
fix tests and warnings in plot_effects
SvenKlaassen Mar 17, 2025
bda195c
update plot ci sizes
SvenKlaassen Mar 17, 2025
810729c
fix docstring and sorting
SvenKlaassen Mar 17, 2025
a484456
fix effect plot first treated
SvenKlaassen Mar 17, 2025
abaf698
update jitter in plot
SvenKlaassen Mar 17, 2025
3299b30
fix jitter time delta
SvenKlaassen Mar 17, 2025
456b25d
remove not assigned variables
SvenKlaassen Mar 17, 2025
fb5b869
formatting
SvenKlaassen Apr 3, 2025
7eb7dfe
adjust unused vars
SvenKlaassen Apr 3, 2025
8b3b8dd
Merge branch 'main' into did-extension
SvenKlaassen Apr 7, 2025
e97df91
change never treated value to np.inf
SvenKlaassen Apr 10, 2025
988f80e
remove unused variable
SvenKlaassen Apr 10, 2025
7164196
add test for anticipation period parameter
SvenKlaassen Apr 14, 2025
fe67bf9
add anticipation property
SvenKlaassen Apr 14, 2025
c104302
Merge branch 'did-extension' of https://github.com/DoubleML/doubleml-…
SvenKlaassen Apr 14, 2025
a3ac286
check default properties for control group and anticipation
SvenKlaassen Apr 14, 2025
f35ac3f
add test if control group is not empty
SvenKlaassen Apr 14, 2025
3444fe1
add check if g_value in t_values
SvenKlaassen Apr 14, 2025
9a3cbbf
add warning for anticipation issues to did binary
SvenKlaassen Apr 14, 2025
a8b00a1
adjust not yet treated for anticipation
SvenKlaassen Apr 14, 2025
c357458
pass anticipation argument
SvenKlaassen Apr 14, 2025
05e79a1
fix preprocess_data
SvenKlaassen Apr 14, 2025
d79c834
add never_treated_value property
SvenKlaassen Apr 14, 2025
0ce9c32
fix not yet treated adjustment
SvenKlaassen Apr 14, 2025
7701f77
add anticipations into constructed combinations
SvenKlaassen Apr 14, 2025
e83af16
add anticipation periods to summary
SvenKlaassen Apr 14, 2025
e65a6d8
add anticipation to summaries
SvenKlaassen Apr 14, 2025
4b6086c
add optional anticipation to CS2021
SvenKlaassen Apr 14, 2025
a9aa3e9
filter anticipation dataset
SvenKlaassen Apr 14, 2025
edaa645
enable never treated
SvenKlaassen Apr 14, 2025
0343af2
fix never_treated anticipation CS2021
SvenKlaassen Apr 14, 2025
872616b
fix not_yet treated control group
SvenKlaassen Apr 14, 2025
ef1dc11
fix not yet treated
SvenKlaassen Apr 14, 2025
1e91f12
axes for case with one treatment period
PhilippBach Apr 16, 2025
6990105
update benchmark docstring
SvenKlaassen Apr 22, 2025
1742cb0
add warning for experimental benchmark
SvenKlaassen Apr 22, 2025
d5b4d9c
add warnings for did benchmark and experimental score
SvenKlaassen Apr 22, 2025
69b6bc7
update docstring
SvenKlaassen Apr 22, 2025
796a787
update docstring
SvenKlaassen Apr 22, 2025
3e0d214
fix docstring
SvenKlaassen Apr 22, 2025
cea50d4
update docstring
SvenKlaassen Apr 22, 2025
97bfd6f
remove example
SvenKlaassen Apr 22, 2025
4341755
add docstring
SvenKlaassen Apr 22, 2025
9152d75
remove todo
SvenKlaassen Apr 22, 2025
d49459d
Merge pull request #315 from DoubleML/p-fix-plot
SvenKlaassen Apr 22, 2025
eee49b8
add docstring
SvenKlaassen Apr 23, 2025
b1fac65
add irm classes
SvenKlaassen Apr 23, 2025
e756a12
fix docstring issues
SvenKlaassen Apr 23, 2025
1867e52
update docstring formula
SvenKlaassen Apr 23, 2025
419dbcd
formatting
SvenKlaassen Apr 23, 2025
50af0e6
fix docstring
SvenKlaassen Apr 23, 2025
3fbd991
fix docstring
SvenKlaassen Apr 23, 2025
f67b1fb
update docstring formulas
SvenKlaassen Apr 23, 2025
b15318b
update kwargs docstring
SvenKlaassen Apr 23, 2025
f26cbaf
remove math header
SvenKlaassen Apr 23, 2025
5a1e3cb
update kwargs description
SvenKlaassen Apr 23, 2025
e50836e
kwargs structure
SvenKlaassen Apr 23, 2025
cf7b185
remove warning test for observational score
SvenKlaassen Apr 24, 2025
78463ec
fix docstring alignments
SvenKlaassen Apr 24, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doubleml/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import importlib.metadata

from .data import DoubleMLClusterData, DoubleMLData
from .did.did import DoubleMLDID
from .did.did_cs import DoubleMLDIDCS
from .double_ml_data import DoubleMLClusterData, DoubleMLData
from .double_ml_framework import DoubleMLFramework, concat
from .irm.apo import DoubleMLAPO
from .irm.apos import DoubleMLAPOS
Expand Down
13 changes: 13 additions & 0 deletions doubleml/data/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
"""
The :mod:`doubleml.data` module implements data classes for double machine learning.
"""

from .base_data import DoubleMLData
from .cluster_data import DoubleMLClusterData
from .panel_data import DoubleMLPanelData

__all__ = [
"DoubleMLData",
"DoubleMLClusterData",
"DoubleMLPanelData",
]
470 changes: 55 additions & 415 deletions doubleml/double_ml_data.py → doubleml/data/base_data.py

Large diffs are not rendered by default.

289 changes: 289 additions & 0 deletions doubleml/data/cluster_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,289 @@
import io

import numpy as np
import pandas as pd
from sklearn.utils import assert_all_finite
from sklearn.utils.validation import check_array

from doubleml.data.base_data import DoubleMLBaseData, DoubleMLData
from doubleml.utils._estimation import _assure_2d_array


class DoubleMLClusterData(DoubleMLData):
"""Double machine learning data-backend for data with cluster variables.

:class:`DoubleMLClusterData` objects can be initialized from
:class:`pandas.DataFrame`'s as well as :class:`numpy.ndarray`'s.

Parameters
----------
data : :class:`pandas.DataFrame`
The data.

y_col : str
The outcome variable.

d_cols : str or list
The treatment variable(s).

cluster_cols : str or list
The cluster variable(s).

x_cols : None, str or list
The covariates.
If ``None``, all variables (columns of ``data``) which are neither specified as outcome variable ``y_col``, nor
treatment variables ``d_cols``, nor instrumental variables ``z_cols`` are used as covariates.
Default is ``None``.

z_cols : None, str or list
The instrumental variable(s).
Default is ``None``.

t_col : None or str
The time variable (only relevant/used for DiD Estimators).
Default is ``None``.

s_col : None or str
The score or selection variable (only relevant/used for RDD and SSM Estimatiors).
Default is ``None``.

use_other_treat_as_covariate : bool
Indicates whether in the multiple-treatment case the other treatment variables should be added as covariates.
Default is ``True``.

force_all_x_finite : bool or str
Indicates whether to raise an error on infinite values and / or missings in the covariates ``x``.
Possible values are: ``True`` (neither missings ``np.nan``, ``pd.NA`` nor infinite values ``np.inf`` are
allowed), ``False`` (missings and infinite values are allowed), ``'allow-nan'`` (only missings are allowed).
Note that the choice ``False`` and ``'allow-nan'`` are only reasonable if the machine learning methods used
for the nuisance functions are capable to provide valid predictions with missings and / or infinite values
in the covariates ``x``.
Default is ``True``.

Examples
--------
>>> from doubleml import DoubleMLClusterData
>>> from doubleml.datasets import make_pliv_multiway_cluster_CKMS2021
>>> # initialization from pandas.DataFrame
>>> df = make_pliv_multiway_cluster_CKMS2021(return_type='DataFrame')
>>> obj_dml_data_from_df = DoubleMLClusterData(df, 'Y', 'D', ['cluster_var_i', 'cluster_var_j'], z_cols='Z')
>>> # initialization from np.ndarray
>>> (x, y, d, cluster_vars, z) = make_pliv_multiway_cluster_CKMS2021(return_type='array')
>>> obj_dml_data_from_array = DoubleMLClusterData.from_arrays(x, y, d, cluster_vars, z)
"""

def __init__(
self,
data,
y_col,
d_cols,
cluster_cols,
x_cols=None,
z_cols=None,
t_col=None,
s_col=None,
use_other_treat_as_covariate=True,
force_all_x_finite=True,
):
DoubleMLBaseData.__init__(self, data)

# we need to set cluster_cols (needs _data) before call to the super __init__ because of the x_cols setter
self.cluster_cols = cluster_cols
self._set_cluster_vars()
DoubleMLData.__init__(
self, data, y_col, d_cols, x_cols, z_cols, t_col, s_col, use_other_treat_as_covariate, force_all_x_finite
)
self._check_disjoint_sets_cluster_cols()

def __str__(self):
data_summary = self._data_summary_str()
buf = io.StringIO()
self.data.info(verbose=False, buf=buf)
df_info = buf.getvalue()
res = (
"================== DoubleMLClusterData Object ==================\n"
+ "\n------------------ Data summary ------------------\n"
+ data_summary
+ "\n------------------ DataFrame info ------------------\n"
+ df_info
)
return res

def _data_summary_str(self):
data_summary = (
f"Outcome variable: {self.y_col}\n"
f"Treatment variable(s): {self.d_cols}\n"
f"Cluster variable(s): {self.cluster_cols}\n"
f"Covariates: {self.x_cols}\n"
f"Instrument variable(s): {self.z_cols}\n"
)
if self.t_col is not None:
data_summary += f"Time variable: {self.t_col}\n"
if self.s_col is not None:
data_summary += f"Score/Selection variable: {self.s_col}\n"

data_summary += f"No. Observations: {self.n_obs}\n"
return data_summary

@classmethod
def from_arrays(
cls, x, y, d, cluster_vars, z=None, t=None, s=None, use_other_treat_as_covariate=True, force_all_x_finite=True
):
"""
Initialize :class:`DoubleMLClusterData` from :class:`numpy.ndarray`'s.

Parameters
----------
x : :class:`numpy.ndarray`
Array of covariates.

y : :class:`numpy.ndarray`
Array of the outcome variable.

d : :class:`numpy.ndarray`
Array of treatment variables.

cluster_vars : :class:`numpy.ndarray`
Array of cluster variables.

z : None or :class:`numpy.ndarray`
Array of instrumental variables.
Default is ``None``.

t : :class:`numpy.ndarray`
Array of the time variable (only relevant/used for DiD models).
Default is ``None``.

s : :class:`numpy.ndarray`
Array of the score or selection variable (only relevant/used for RDD or SSM models).
Default is ``None``.

use_other_treat_as_covariate : bool
Indicates whether in the multiple-treatment case the other treatment variables should be added as covariates.
Default is ``True``.

force_all_x_finite : bool or str
Indicates whether to raise an error on infinite values and / or missings in the covariates ``x``.
Possible values are: ``True`` (neither missings ``np.nan``, ``pd.NA`` nor infinite values ``np.inf`` are
allowed), ``False`` (missings and infinite values are allowed), ``'allow-nan'`` (only missings are allowed).
Note that the choice ``False`` and ``'allow-nan'`` are only reasonable if the machine learning methods used
for the nuisance functions are capable to provide valid predictions with missings and / or infinite values
in the covariates ``x``.
Default is ``True``.

Examples
--------
>>> from doubleml import DoubleMLClusterData
>>> from doubleml.datasets import make_pliv_multiway_cluster_CKMS2021
>>> (x, y, d, cluster_vars, z) = make_pliv_multiway_cluster_CKMS2021(return_type='array')
>>> obj_dml_data_from_array = DoubleMLClusterData.from_arrays(x, y, d, cluster_vars, z)
"""
dml_data = DoubleMLData.from_arrays(x, y, d, z, t, s, use_other_treat_as_covariate, force_all_x_finite)
cluster_vars = check_array(cluster_vars, ensure_2d=False, allow_nd=False)
cluster_vars = _assure_2d_array(cluster_vars)
if cluster_vars.shape[1] == 1:
cluster_cols = ["cluster_var"]
else:
cluster_cols = [f"cluster_var{i + 1}" for i in np.arange(cluster_vars.shape[1])]

data = pd.concat((pd.DataFrame(cluster_vars, columns=cluster_cols), dml_data.data), axis=1)

return cls(
data,
dml_data.y_col,
dml_data.d_cols,
cluster_cols,
dml_data.x_cols,
dml_data.z_cols,
dml_data.t_col,
dml_data.s_col,
dml_data.use_other_treat_as_covariate,
dml_data.force_all_x_finite,
)

@property
def cluster_cols(self):
"""
The cluster variable(s).
"""
return self._cluster_cols

@cluster_cols.setter
def cluster_cols(self, value):
reset_value = hasattr(self, "_cluster_cols")
if isinstance(value, str):
value = [value]
if not isinstance(value, list):
raise TypeError(
"The cluster variable(s) cluster_cols must be of str or list type. "
f"{str(value)} of type {str(type(value))} was passed."
)
if not len(set(value)) == len(value):
raise ValueError("Invalid cluster variable(s) cluster_cols: Contains duplicate values.")
if not set(value).issubset(set(self.all_variables)):
raise ValueError("Invalid cluster variable(s) cluster_cols. At least one cluster variable is no data column.")
self._cluster_cols = value
if reset_value:
self._check_disjoint_sets()
self._set_cluster_vars()

@property
def n_cluster_vars(self):
"""
The number of cluster variables.
"""
return len(self.cluster_cols)

@property
def cluster_vars(self):
"""
Array of cluster variable(s).
"""
return self._cluster_vars.values

def _get_optional_col_sets(self):
base_optional_col_sets = super()._get_optional_col_sets()
cluster_cols_set = set(self.cluster_cols)
return [cluster_cols_set] + base_optional_col_sets

def _check_disjoint_sets(self):
# apply the standard checks from the DoubleMLData class
super(DoubleMLClusterData, self)._check_disjoint_sets()
self._check_disjoint_sets_cluster_cols()

def _check_disjoint_sets_cluster_cols(self):
# apply the standard checks from the DoubleMLData class
super(DoubleMLClusterData, self)._check_disjoint_sets()

# special checks for the additional cluster variables
cluster_cols_set = set(self.cluster_cols)
y_col_set = {self.y_col}
x_cols_set = set(self.x_cols)
d_cols_set = set(self.d_cols)

z_cols_set = set(self.z_cols or [])
t_col_set = {self.t_col} if self.t_col else set()
s_col_set = {self.s_col} if self.s_col else set()

# TODO: X can not be used as cluster variable
cluster_checks_args = [
(y_col_set, "outcome variable", "``y_col``"),
(d_cols_set, "treatment variable", "``d_cols``"),
(x_cols_set, "covariate", "``x_cols``"),
(z_cols_set, "instrumental variable", "``z_cols``"),
(t_col_set, "time variable", "``t_col``"),
(s_col_set, "score or selection variable", "``s_col``"),
]
for set1, name, argument in cluster_checks_args:
self._check_disjoint(
set1=set1,
name1=name,
arg1=argument,
set2=cluster_cols_set,
name2="cluster variable(s)",
arg2="``cluster_cols``",
)

def _set_cluster_vars(self):
assert_all_finite(self.data.loc[:, self.cluster_cols])
self._cluster_vars = self.data.loc[:, self.cluster_cols]
Loading