API documentation¶
ECH class¶
- class pyech.core.ECH(dirpath=None, categorical_threshold=50, splitter=None)[source]¶
Bases:
object
Download, read and process the 2006-2020 Encuesta Continua de Hogares survey carried out by Uruguay’s Instituto Nacional de Estadística.
- Parameters
dirpath (
Union
[Path
,str
,None
]) – Path where to download new surveys or read existing ones, by default “.”.categorical_threshold (
int
) – Number of unique values below which the variable is considered categorical, by default 50.splitter (Optional[str]) –
- data¶
Survey data, by default pd.DataFrame().
- Type
pd.DataFrame
- metadata¶
Survey metadata, by default None.
- Type
metadata_container
- weights¶
Column in
data
used to weight cases. Generally “pesoano” for annual weighting, by default None- Type
Optional[str]
- splitter¶
Variable(s) to use for grouping in methods (
summarize
,assign_ptile
), by default [].- Type
Union[str, List[str]]
- dictionary¶
Variable dictionary, by default pd.DataFrame().
- Type
pd.DataFrame
- cpi¶
Monthly CPI data, by default pd.DataFrame().
- Type
pd.DataFrame
- nxr¶
Monthly nominal exchange rate data, by default pd.DataFrame().
- Type
pd.DataFrame
- classmethod from_data(data, metadata, splitter=None, weights=None)[source]¶
Build
ECH
fromdata
andmetadata
as created by pyreadstat.read_sav().- Parameters
data (
DataFrame
) – Survey data.metadata (
metadata_container
) – Survey metadata.weights (
Optional
[str
]) – Column indata
used to weight cases. Generally “pesoano” for annual weighting, by default Nonesplitter (
Union
[str
,List
[str
],None
]) – Variable(s) to use for grouping in methods (summarize
,assign_ptile
), by default [].
- Returns
- Return type
- property splitter¶
- property year: int¶
- Return type
int
- property weights: Optional[str]¶
- Return type
Optional
[str
]
- load(year, from_repo=False, weights=None, splitter=None, missing='\\s+\\.', missing_regex=True, lower=True, multiprocess=False)[source]¶
Load a ECH survey and dictionary from a specified year.
First attempt to read a survey by looking for “year.sav” in
dirpath
. If it cannot be found, download the .rar file, extract it to a temporary directory, move the renamed .sav file todirpath
and then read. Optionally replace missing values with numpy.nan, lower all variable names and download the corresponding variable dictonary.For the 2020 survey a new column called “pesoano” is calculated according to the following formula: pesoano = pesomen / 12. The result is rounded and converted to int. This is because 2020 is the first survey that does not have annual weights (“pesoano”). However, they can be calculated from monthly weights (“pesomen”).
- Parameters
year (
int
) – Survey yearfrom_repo (
bool
) – If True, download the survey from the Github repo as a HDFS+JSON combo.weights (
Optional
[str
]) – Variable used for weighting cases, by default None.splitter (
Union
[str
,List
[str
],None
]) – Variable(s) to use for grouping in methods (summarize
,assign_ptile
), by default []missing (
Optional
[str
]) – Missing values to replace with numpy.nan. Can be a regex with missing_regex=True, by default r”s+.”.missing_regex (
bool
) – Whether to parse missing as regex, by default True.lower (
bool
) – Whether to turn variable names to lower case. This helps with analyzing surveys for several years, by default True.multiprocess (
bool
) – Whether to use multiprocessing to read the file. It will use all available CPUs, by default False.
- Return type
None
- static download(dirpath, year)[source]¶
Download a ECH survey, unpack the .rar, extract the .sav, rename as “year.sav” and place in
dirpath
.- Parameters
dirpath (
Union
[Path
,str
]) – Download location.year (
int
) – Survey year.
- Return type
None
- get_dictionary(year)[source]¶
Download and process variable dictionary for a specified year.
- Parameters
year (
int
) – Survey year.- Return type
None
- search_dictionary(term, ignore_case=True, regex=True)[source]¶
Return rows in
dictionary
with matching terms.- Parameters
term (
str
) – Search term.ignore_case (
bool
) – Whether to search for upper and lower case. Requires regex=True, by default True.regex (
bool
) – Whether to parse term as regex, by default True.
- Returns
DataFrame containing matching rows.
- Return type
pd.DataFrame
- summarize(variable, by=None, is_categorical=None, aggfunc='mean', household_level=False, prequery=None, variable_labels=False, value_labels=True, dropna=False)[source]¶
Summarize a variable in
data
.- Parameters
variable (
str
) – Variable to summarize.by (
Union
[str
,List
[str
],None
]) – Summarize by these groups, as well as those insplitter
, by default None.is_categorical (
Optional
[bool
]) – Whether value should be treated as categorical. If None, compare withcategorical_threshold
, by default None.aggfunc (
Union
[str
,Callable
]) – Aggregating function. Possible values are “mean”, “sum”, “count”, or any function that works with pd.DataFrame.apply. If values is categorical will force aggfunc=”count”, by default “mean”.prequery (
Optional
[str
]) – Pass a string representing a boolean expression to query the survey before summarizing. For example, ‘e27 >= 18’ would filter out observations where the “e27” variable (age) is lower than 18, and then carry on with summarization. Leverages pandas’ query.household_level (
bool
) – If True, summarize at the household level (i.e. consider onlydata
[“nper”] == 1), by default False.variable_labels (
bool
) – Whether to use variable labels frommetadata
, by default True.value_labels (
bool
) – Whether to use value labels frommetadata
, by default True.dropna (
bool
) – Whether to drop groups with no observations, by default False.
- Returns
Summarized variable.
- Return type
pd.DataFrame
- Raises
AttributeError – If
weights
is not defined.
- assign_ptile(variable, n, labels=False, by=None, result_weighted=False, name=None, household_level=False)[source]¶
Calculate n-tiles for a variable. By default add as new column to
data
.- Parameters
variable (
str
) – Reference variable.n (
int
) – Number of bins to calculate.labels (
Union
[bool
,Sequence
[str
]]) – Passed to pandas.qcut. If False, use int labels for the resulting bins. If True, name bins by their edges. Otherwise pass a sequence of length equal to n, by default False.by (
Union
[str
,List
[str
],None
]) – Calculate bins for each of the groups, as well as those insplitter
, by default None.result_weighted (
bool
) – If True, return a pd.DataFrame with the weighted result. Else, add as a column todata
, by default Falsename (
Optional
[str
]) – Name for the new column. If None, set as “variable`_`n”, by default None:household_level (
bool
) – If True, calculate at the household level (i.e. consider onlydata
[“nper”] == 1), by default False.
- Returns
- Return type
Optional[pd.DataFrame]
- Raises
AttributeError – If
weights
is not defined.
- convert_real(variables, start=None, end=None)[source]¶
Convert selected monetary variables to real terms.
- Parameters
variables (
Union
[str
,List
[str
]]) – Column(s) indata
. Can be a string or a sequence of strings for multiple columns.start (
Union
[str
,datetime
,date
,None
]) – Set prices to either of these dates or the mean between them, by default None.end (
Union
[str
,datetime
,date
,None
]) – Set prices to either of these dates or the mean between them, by default None.
- Return type
None
- convert_usd(variables)[source]¶
Convert selected monetary variables to USD.
- Parameters
variables (
Union
[str
,List
[str
]]) – Column(s) indata
. Can be a string or a sequence of strings for multiple columns.- Return type
None
- apply_weights(variables)[source]¶
Repeat rows as many times as defined in
weights
.- Parameters
variables (
Union
[str
,List
[str
]]) – Columns for which weights should be applied. In general it is a good idea to avoid applying weights to all columns since this can result in a large DataFrame.- Returns
- Return type
pd.DataFrame
- Raises
AttributeError – If
weights
is not defined.