
PyECH¶
Process INE’s ECH surveys in Python.
User documentation¶
Brief introduction.

Overview¶
A simple package that streamlines the download-read-wrangling process needed to analyze the Encuesta Continua de Hogares survey carried out by the Instituto Nacional de Estadística (Uruguay).
Here’s what PyECH can do:
Download survey compressed files.
Unrar, rename and move the SAV (SPSS) file to a specified path.
Read surveys from SAV files, keeping variable and value labels.
Download and process variable dictionaries.
Search through variable dictionaries.
Summarize variables.
Calculate variable n-tiles.
Convert variables to real terms or USD.
PyECH does not attempt to estimate any indicators in particular, or facilitate any kind of modelling, or concatenate surveys from multiple years. Instead, it aims at providing a hassle-free experience with as simple a syntax as possible.
Surprisingly, PyECH covers a lot of what people tend to do with the ECH survey without having to deal with software licensing.
For R users, check out ech.
Installation¶
pip install pyech
Dependencies¶
In order to unpack downloaded survey files you will need to have unrar in your system. This should be covered if you have WinRAR or 7zip installed. Otherwise sudo apt-get install unrar
or what’s appropiate for your system.
Usage¶
Loading a survey is as simple as using ECH.load
, which will download it if it cannot be found at dirpath
(by default the current working directory).
from pyech import ECH
survey = ECH()
survey.load(year=2019, weights="pesoano")
Optionally, load
accepts from_repo=True
, which downloads survey data from the PyECH Github repository (HDFS+JSON). Loading data this way is significantly faster.
ECH.load
also downloads the corresponding variable dictionary, which can be easily searched.
survey.search_dictionary("ingreso", ignore_case=True, regex=True)
This will return a pandas DataFrame where every row matches the search term in any of its columns.
Calculating aggregations is as simple as using ECH.summarize
.
survey.summarize("ht11", by="dpto", aggfunc="mean", household_level=True)
Which returns a pandas DataFrame with the mean of “ht11” grouped by ECH.splitter
and by
(both are optional). Cases are weighted by the column defined in ECH.load
.
API documentation¶
Or read the API documentation (automatically generated from source code) for the specifics.
API documentation¶
ECH class¶
- class pyech.core.ECH(dirpath=None, categorical_threshold=50, splitter=None)[source]¶
Bases:
object
Download, read and process the 2006-2020 Encuesta Continua de Hogares survey carried out by Uruguay’s Instituto Nacional de Estadística.
- Parameters
dirpath (
Union
[Path
,str
,None
]) – Path where to download new surveys or read existing ones, by default “.”.categorical_threshold (
int
) – Number of unique values below which the variable is considered categorical, by default 50.splitter (Optional[str]) –
- data¶
Survey data, by default pd.DataFrame().
- Type
pd.DataFrame
- metadata¶
Survey metadata, by default None.
- Type
metadata_container
- weights¶
Column in
data
used to weight cases. Generally “pesoano” for annual weighting, by default None- Type
Optional[str]
- splitter¶
Variable(s) to use for grouping in methods (
summarize
,assign_ptile
), by default [].- Type
Union[str, List[str]]
- dictionary¶
Variable dictionary, by default pd.DataFrame().
- Type
pd.DataFrame
- cpi¶
Monthly CPI data, by default pd.DataFrame().
- Type
pd.DataFrame
- nxr¶
Monthly nominal exchange rate data, by default pd.DataFrame().
- Type
pd.DataFrame
- classmethod from_data(data, metadata, splitter=None, weights=None)[source]¶
Build
ECH
fromdata
andmetadata
as created by pyreadstat.read_sav().- Parameters
data (
DataFrame
) – Survey data.metadata (
metadata_container
) – Survey metadata.weights (
Optional
[str
]) – Column indata
used to weight cases. Generally “pesoano” for annual weighting, by default Nonesplitter (
Union
[str
,List
[str
],None
]) – Variable(s) to use for grouping in methods (summarize
,assign_ptile
), by default [].
- Returns
- Return type
- property splitter¶
- property year: int¶
- Return type
int
- property weights: Optional[str]¶
- Return type
Optional
[str
]
- load(year, from_repo=False, weights=None, splitter=None, missing='\\s+\\.', missing_regex=True, lower=True, multiprocess=False)[source]¶
Load a ECH survey and dictionary from a specified year.
First attempt to read a survey by looking for “year.sav” in
dirpath
. If it cannot be found, download the .rar file, extract it to a temporary directory, move the renamed .sav file todirpath
and then read. Optionally replace missing values with numpy.nan, lower all variable names and download the corresponding variable dictonary.For the 2020 survey a new column called “pesoano” is calculated according to the following formula: pesoano = pesomen / 12. The result is rounded and converted to int. This is because 2020 is the first survey that does not have annual weights (“pesoano”). However, they can be calculated from monthly weights (“pesomen”).
- Parameters
year (
int
) – Survey yearfrom_repo (
bool
) – If True, download the survey from the Github repo as a HDFS+JSON combo.weights (
Optional
[str
]) – Variable used for weighting cases, by default None.splitter (
Union
[str
,List
[str
],None
]) – Variable(s) to use for grouping in methods (summarize
,assign_ptile
), by default []missing (
Optional
[str
]) – Missing values to replace with numpy.nan. Can be a regex with missing_regex=True, by default r”s+.”.missing_regex (
bool
) – Whether to parse missing as regex, by default True.lower (
bool
) – Whether to turn variable names to lower case. This helps with analyzing surveys for several years, by default True.multiprocess (
bool
) – Whether to use multiprocessing to read the file. It will use all available CPUs, by default False.
- Return type
None
- static download(dirpath, year)[source]¶
Download a ECH survey, unpack the .rar, extract the .sav, rename as “year.sav” and place in
dirpath
.- Parameters
dirpath (
Union
[Path
,str
]) – Download location.year (
int
) – Survey year.
- Return type
None
- get_dictionary(year)[source]¶
Download and process variable dictionary for a specified year.
- Parameters
year (
int
) – Survey year.- Return type
None
- search_dictionary(term, ignore_case=True, regex=True)[source]¶
Return rows in
dictionary
with matching terms.- Parameters
term (
str
) – Search term.ignore_case (
bool
) – Whether to search for upper and lower case. Requires regex=True, by default True.regex (
bool
) – Whether to parse term as regex, by default True.
- Returns
DataFrame containing matching rows.
- Return type
pd.DataFrame
- summarize(variable, by=None, is_categorical=None, aggfunc='mean', household_level=False, prequery=None, variable_labels=False, value_labels=True, dropna=False)[source]¶
Summarize a variable in
data
.- Parameters
variable (
str
) – Variable to summarize.by (
Union
[str
,List
[str
],None
]) – Summarize by these groups, as well as those insplitter
, by default None.is_categorical (
Optional
[bool
]) – Whether value should be treated as categorical. If None, compare withcategorical_threshold
, by default None.aggfunc (
Union
[str
,Callable
]) – Aggregating function. Possible values are “mean”, “sum”, “count”, or any function that works with pd.DataFrame.apply. If values is categorical will force aggfunc=”count”, by default “mean”.prequery (
Optional
[str
]) – Pass a string representing a boolean expression to query the survey before summarizing. For example, ‘e27 >= 18’ would filter out observations where the “e27” variable (age) is lower than 18, and then carry on with summarization. Leverages pandas’ query.household_level (
bool
) – If True, summarize at the household level (i.e. consider onlydata
[“nper”] == 1), by default False.variable_labels (
bool
) – Whether to use variable labels frommetadata
, by default True.value_labels (
bool
) – Whether to use value labels frommetadata
, by default True.dropna (
bool
) – Whether to drop groups with no observations, by default False.
- Returns
Summarized variable.
- Return type
pd.DataFrame
- Raises
AttributeError – If
weights
is not defined.
- assign_ptile(variable, n, labels=False, by=None, result_weighted=False, name=None, household_level=False)[source]¶
Calculate n-tiles for a variable. By default add as new column to
data
.- Parameters
variable (
str
) – Reference variable.n (
int
) – Number of bins to calculate.labels (
Union
[bool
,Sequence
[str
]]) – Passed to pandas.qcut. If False, use int labels for the resulting bins. If True, name bins by their edges. Otherwise pass a sequence of length equal to n, by default False.by (
Union
[str
,List
[str
],None
]) – Calculate bins for each of the groups, as well as those insplitter
, by default None.result_weighted (
bool
) – If True, return a pd.DataFrame with the weighted result. Else, add as a column todata
, by default Falsename (
Optional
[str
]) – Name for the new column. If None, set as “variable`_`n”, by default None:household_level (
bool
) – If True, calculate at the household level (i.e. consider onlydata
[“nper”] == 1), by default False.
- Returns
- Return type
Optional[pd.DataFrame]
- Raises
AttributeError – If
weights
is not defined.
- convert_real(variables, start=None, end=None)[source]¶
Convert selected monetary variables to real terms.
- Parameters
variables (
Union
[str
,List
[str
]]) – Column(s) indata
. Can be a string or a sequence of strings for multiple columns.start (
Union
[str
,datetime
,date
,None
]) – Set prices to either of these dates or the mean between them, by default None.end (
Union
[str
,datetime
,date
,None
]) – Set prices to either of these dates or the mean between them, by default None.
- Return type
None
- convert_usd(variables)[source]¶
Convert selected monetary variables to USD.
- Parameters
variables (
Union
[str
,List
[str
]]) – Column(s) indata
. Can be a string or a sequence of strings for multiple columns.- Return type
None
- apply_weights(variables)[source]¶
Repeat rows as many times as defined in
weights
.- Parameters
variables (
Union
[str
,List
[str
]]) – Columns for which weights should be applied. In general it is a good idea to avoid applying weights to all columns since this can result in a large DataFrame.- Returns
- Return type
pd.DataFrame
- Raises
AttributeError – If
weights
is not defined.