API documentation¶

ECH class¶

class pyech.core.ECH(dirpath=None, categorical_threshold=50, splitter=None)[source]¶

Bases: object

Download, read and process the 2006-2020 Encuesta Continua de Hogares survey carried out by Uruguay’s Instituto Nacional de Estadística.

Parameters

dirpath (Union[Path, str, None]) – Path where to download new surveys or read existing ones, by default “.”.
categorical_threshold (int) – Number of unique values below which the variable is considered categorical, by default 50.
splitter (Optional[str]) –

data¶

Survey data, by default pd.DataFrame().

Type: pd.DataFrame

metadata¶

Survey metadata, by default None.

Type: metadata_container

weights¶

Column in data used to weight cases. Generally “pesoano” for annual weighting, by default None

Type: Optional[str]

splitter¶

Variable(s) to use for grouping in methods (summarize, assign_ptile), by default [].

Type: Union[str, List[str]]

dictionary¶

Variable dictionary, by default pd.DataFrame().

Type: pd.DataFrame

cpi¶

Monthly CPI data, by default pd.DataFrame().

Type: pd.DataFrame

nxr¶

Monthly nominal exchange rate data, by default pd.DataFrame().

Type: pd.DataFrame

classmethod from_data(data, metadata, splitter=None, weights=None)[source]¶

Build ECH from data and metadata as created by pyreadstat.read_sav().

Parameters

data (DataFrame) – Survey data.
metadata (metadata_container) – Survey metadata.
weights (Optional[str]) – Column in data used to weight cases. Generally “pesoano” for annual weighting, by default None
splitter (Union[str, List[str], None]) – Variable(s) to use for grouping in methods (summarize, assign_ptile), by default [].

Returns

Return type

ECH

property splitter¶

property year: int¶

Return type: int

property weights: Optional[str]¶

Return type: Optional[str]

load(year, from_repo=False, weights=None, splitter=None, missing='\\s+\\.', missing_regex=True, lower=True, multiprocess=False)[source]¶

Load a ECH survey and dictionary from a specified year.

First attempt to read a survey by looking for “year.sav” in dirpath. If it cannot be found, download the .rar file, extract it to a temporary directory, move the renamed .sav file to dirpath and then read. Optionally replace missing values with numpy.nan, lower all variable names and download the corresponding variable dictonary.

For the 2020 survey a new column called “pesoano” is calculated according to the following formula: pesoano = pesomen / 12. The result is rounded and converted to int. This is because 2020 is the first survey that does not have annual weights (“pesoano”). However, they can be calculated from monthly weights (“pesomen”).

Parameters

year (int) – Survey year
from_repo (bool) – If True, download the survey from the Github repo as a HDFS+JSON combo.
weights (Optional[str]) – Variable used for weighting cases, by default None.
splitter (Union[str, List[str], None]) – Variable(s) to use for grouping in methods (summarize, assign_ptile), by default []
missing (Optional[str]) – Missing values to replace with numpy.nan. Can be a regex with missing_regex=True, by default r”s+.”.
missing_regex (bool) – Whether to parse missing as regex, by default True.
lower (bool) – Whether to turn variable names to lower case. This helps with analyzing surveys for several years, by default True.
multiprocess (bool) – Whether to use multiprocessing to read the file. It will use all available CPUs, by default False.

Return type

None

static download(dirpath, year)[source]¶

Download a ECH survey, unpack the .rar, extract the .sav, rename as “year.sav” and place in dirpath.

Parameters

dirpath (Union[Path, str]) – Download location.
year (int) – Survey year.

Return type

None

get_dictionary(year)[source]¶

Download and process variable dictionary for a specified year.

Parameters: year (int) – Survey year.
Return type: None

search_dictionary(term, ignore_case=True, regex=True)[source]¶

Return rows in dictionary with matching terms.

Parameters

term (str) – Search term.
ignore_case (bool) – Whether to search for upper and lower case. Requires regex=True, by default True.
regex (bool) – Whether to parse term as regex, by default True.

Returns

DataFrame containing matching rows.

Return type

pd.DataFrame

summarize(variable, by=None, is_categorical=None, aggfunc='mean', household_level=False, prequery=None, variable_labels=False, value_labels=True, dropna=False)[source]¶

Summarize a variable in data.

Parameters

variable (str) – Variable to summarize.
by (Union[str, List[str], None]) – Summarize by these groups, as well as those in splitter, by default None.
is_categorical (Optional[bool]) – Whether value should be treated as categorical. If None, compare with categorical_threshold, by default None.
aggfunc (Union[str, Callable]) – Aggregating function. Possible values are “mean”, “sum”, “count”, or any function that works with pd.DataFrame.apply. If values is categorical will force aggfunc=”count”, by default “mean”.
prequery (Optional[str]) – Pass a string representing a boolean expression to query the survey before summarizing. For example, ‘e27 >= 18’ would filter out observations where the “e27” variable (age) is lower than 18, and then carry on with summarization. Leverages pandas’ query.
household_level (bool) – If True, summarize at the household level (i.e. consider only data [“nper”] == 1), by default False.
variable_labels (bool) – Whether to use variable labels from metadata, by default True.
value_labels (bool) – Whether to use value labels from metadata, by default True.
dropna (bool) – Whether to drop groups with no observations, by default False.

Returns

Summarized variable.

Return type

pd.DataFrame

Raises

AttributeError – If weights is not defined.

assign_ptile(variable, n, labels=False, by=None, result_weighted=False, name=None, household_level=False)[source]¶

Calculate n-tiles for a variable. By default add as new column to data.

Parameters

variable (str) – Reference variable.
n (int) – Number of bins to calculate.
labels (Union[bool, Sequence[str]]) – Passed to pandas.qcut. If False, use int labels for the resulting bins. If True, name bins by their edges. Otherwise pass a sequence of length equal to n, by default False.
by (Union[str, List[str], None]) – Calculate bins for each of the groups, as well as those in splitter, by default None.
result_weighted (bool) – If True, return a pd.DataFrame with the weighted result. Else, add as a column to data, by default False
name (Optional[str]) – Name for the new column. If None, set as “variable`_`n”, by default None:
household_level (bool) – If True, calculate at the household level (i.e. consider only data [“nper”] == 1), by default False.

Returns

Return type

Optional[pd.DataFrame]

Raises

AttributeError – If weights is not defined.

convert_real(variables, start=None, end=None)[source]¶

Convert selected monetary variables to real terms.

Parameters

variables (Union[str, List[str]]) – Column(s) in data. Can be a string or a sequence of strings for multiple columns.
start (Union[str, datetime, date, None]) – Set prices to either of these dates or the mean between them, by default None.
end (Union[str, datetime, date, None]) – Set prices to either of these dates or the mean between them, by default None.

Return type

None

convert_usd(variables)[source]¶

Convert selected monetary variables to USD.

Parameters: variables (Union[str, List[str]]) – Column(s) in data. Can be a string or a sequence of strings for multiple columns.
Return type: None

apply_weights(variables)[source]¶

Repeat rows as many times as defined in weights.

Parameters: variables (Union[str, List[str]]) – Columns for which weights should be applied. In general it is a good idea to avoid applying weights to all columns since this can result in a large DataFrame.
Returns
Return type: pd.DataFrame
Raises: AttributeError – If weights is not defined.

save(base_filename, key='df', complevel=9, complib='blosc', **kwargs)[source]¶

Parameters

base_filename (str) –
key (str) –
complevel (Optional[int]) –
complib (Optional[str]) –

External utility functions¶

pyech.external.get_cpi()[source]¶

Download and process CPI data.

Returns
Return type: pd.DataFrame

pyech.external.get_nxr()[source]¶

Download and process USDUYU nominal exchange rate.

Returns
Return type: pd.DataFrame