API documentation

ECH class

class pyech.core.ECH(dirpath=None, categorical_threshold=50, splitter=None)[source]

Bases: object

Download, read and process the 2006-2020 Encuesta Continua de Hogares survey carried out by Uruguay’s Instituto Nacional de Estadística.

Parameters
  • dirpath (Union[Path, str, None]) – Path where to download new surveys or read existing ones, by default “.”.

  • categorical_threshold (int) – Number of unique values below which the variable is considered categorical, by default 50.

  • splitter (Optional[str]) –

data

Survey data, by default pd.DataFrame().

Type

pd.DataFrame

metadata

Survey metadata, by default None.

Type

metadata_container

weights

Column in data used to weight cases. Generally “pesoano” for annual weighting, by default None

Type

Optional[str]

splitter

Variable(s) to use for grouping in methods (summarize, assign_ptile), by default [].

Type

Union[str, List[str]]

dictionary

Variable dictionary, by default pd.DataFrame().

Type

pd.DataFrame

cpi

Monthly CPI data, by default pd.DataFrame().

Type

pd.DataFrame

nxr

Monthly nominal exchange rate data, by default pd.DataFrame().

Type

pd.DataFrame

classmethod from_data(data, metadata, splitter=None, weights=None)[source]

Build ECH from data and metadata as created by pyreadstat.read_sav().

Parameters
  • data (DataFrame) – Survey data.

  • metadata (metadata_container) – Survey metadata.

  • weights (Optional[str]) – Column in data used to weight cases. Generally “pesoano” for annual weighting, by default None

  • splitter (Union[str, List[str], None]) – Variable(s) to use for grouping in methods (summarize, assign_ptile), by default [].

Returns

Return type

ECH

property splitter
property year: int
Return type

int

property weights: Optional[str]
Return type

Optional[str]

load(year, from_repo=False, weights=None, splitter=None, missing='\\s+\\.', missing_regex=True, lower=True, multiprocess=False)[source]

Load a ECH survey and dictionary from a specified year.

First attempt to read a survey by looking for “year.sav” in dirpath. If it cannot be found, download the .rar file, extract it to a temporary directory, move the renamed .sav file to dirpath and then read. Optionally replace missing values with numpy.nan, lower all variable names and download the corresponding variable dictonary.

For the 2020 survey a new column called “pesoano” is calculated according to the following formula: pesoano = pesomen / 12. The result is rounded and converted to int. This is because 2020 is the first survey that does not have annual weights (“pesoano”). However, they can be calculated from monthly weights (“pesomen”).

Parameters
  • year (int) – Survey year

  • from_repo (bool) – If True, download the survey from the Github repo as a HDFS+JSON combo.

  • weights (Optional[str]) – Variable used for weighting cases, by default None.

  • splitter (Union[str, List[str], None]) – Variable(s) to use for grouping in methods (summarize, assign_ptile), by default []

  • missing (Optional[str]) – Missing values to replace with numpy.nan. Can be a regex with missing_regex=True, by default r”s+.”.

  • missing_regex (bool) – Whether to parse missing as regex, by default True.

  • lower (bool) – Whether to turn variable names to lower case. This helps with analyzing surveys for several years, by default True.

  • multiprocess (bool) – Whether to use multiprocessing to read the file. It will use all available CPUs, by default False.

Return type

None

static download(dirpath, year)[source]

Download a ECH survey, unpack the .rar, extract the .sav, rename as “year.sav” and place in dirpath.

Parameters
  • dirpath (Union[Path, str]) – Download location.

  • year (int) – Survey year.

Return type

None

get_dictionary(year)[source]

Download and process variable dictionary for a specified year.

Parameters

year (int) – Survey year.

Return type

None

search_dictionary(term, ignore_case=True, regex=True)[source]

Return rows in dictionary with matching terms.

Parameters
  • term (str) – Search term.

  • ignore_case (bool) – Whether to search for upper and lower case. Requires regex=True, by default True.

  • regex (bool) – Whether to parse term as regex, by default True.

Returns

DataFrame containing matching rows.

Return type

pd.DataFrame

summarize(variable, by=None, is_categorical=None, aggfunc='mean', household_level=False, prequery=None, variable_labels=False, value_labels=True, dropna=False)[source]

Summarize a variable in data.

Parameters
  • variable (str) – Variable to summarize.

  • by (Union[str, List[str], None]) – Summarize by these groups, as well as those in splitter, by default None.

  • is_categorical (Optional[bool]) – Whether value should be treated as categorical. If None, compare with categorical_threshold, by default None.

  • aggfunc (Union[str, Callable]) – Aggregating function. Possible values are “mean”, “sum”, “count”, or any function that works with pd.DataFrame.apply. If values is categorical will force aggfunc=”count”, by default “mean”.

  • prequery (Optional[str]) – Pass a string representing a boolean expression to query the survey before summarizing. For example, ‘e27 >= 18’ would filter out observations where the “e27” variable (age) is lower than 18, and then carry on with summarization. Leverages pandas’ query.

  • household_level (bool) – If True, summarize at the household level (i.e. consider only data [“nper”] == 1), by default False.

  • variable_labels (bool) – Whether to use variable labels from metadata, by default True.

  • value_labels (bool) – Whether to use value labels from metadata, by default True.

  • dropna (bool) – Whether to drop groups with no observations, by default False.

Returns

Summarized variable.

Return type

pd.DataFrame

Raises

AttributeError – If weights is not defined.

assign_ptile(variable, n, labels=False, by=None, result_weighted=False, name=None, household_level=False)[source]

Calculate n-tiles for a variable. By default add as new column to data.

Parameters
  • variable (str) – Reference variable.

  • n (int) – Number of bins to calculate.

  • labels (Union[bool, Sequence[str]]) – Passed to pandas.qcut. If False, use int labels for the resulting bins. If True, name bins by their edges. Otherwise pass a sequence of length equal to n, by default False.

  • by (Union[str, List[str], None]) – Calculate bins for each of the groups, as well as those in splitter, by default None.

  • result_weighted (bool) – If True, return a pd.DataFrame with the weighted result. Else, add as a column to data, by default False

  • name (Optional[str]) – Name for the new column. If None, set as “variable`_`n”, by default None:

  • household_level (bool) – If True, calculate at the household level (i.e. consider only data [“nper”] == 1), by default False.

Returns

Return type

Optional[pd.DataFrame]

Raises

AttributeError – If weights is not defined.

convert_real(variables, start=None, end=None)[source]

Convert selected monetary variables to real terms.

Parameters
  • variables (Union[str, List[str]]) – Column(s) in data. Can be a string or a sequence of strings for multiple columns.

  • start (Union[str, datetime, date, None]) – Set prices to either of these dates or the mean between them, by default None.

  • end (Union[str, datetime, date, None]) – Set prices to either of these dates or the mean between them, by default None.

Return type

None

convert_usd(variables)[source]

Convert selected monetary variables to USD.

Parameters

variables (Union[str, List[str]]) – Column(s) in data. Can be a string or a sequence of strings for multiple columns.

Return type

None

apply_weights(variables)[source]

Repeat rows as many times as defined in weights.

Parameters

variables (Union[str, List[str]]) – Columns for which weights should be applied. In general it is a good idea to avoid applying weights to all columns since this can result in a large DataFrame.

Returns

Return type

pd.DataFrame

Raises

AttributeError – If weights is not defined.

save(base_filename, key='df', complevel=9, complib='blosc', **kwargs)[source]
Parameters
  • base_filename (str) –

  • key (str) –

  • complevel (Optional[int]) –

  • complib (Optional[str]) –

External utility functions

pyech.external.get_cpi()[source]

Download and process CPI data.

Returns

Return type

pd.DataFrame

pyech.external.get_nxr()[source]

Download and process USDUYU nominal exchange rate.

Returns

Return type

pd.DataFrame