DropMissingData¶
API Reference¶
- class feature_engine.imputation.DropMissingData(missing_only=True, variables=None)[source]¶
The DropMissingData() will delete rows containing missing values. It provides similar functionality to pandas.drop_na().
It works for both numerical and categorical variables. You can enter the list of variables for which missing values should be removed from the dataframe. Alternatively, the imputer will automatically select all variables in the dataframe.
Note The transformer will first select all variables or all user entered variables and if
missing_only=True
, it will re-select from the original group only those that show missing data in during fit, that is in the train set.- Parameters
- missing_only: bool, default=True
If true, missing observations will be dropped only for the variables that have missing data in the train set, during fit. If False, observations with NA will be dropped from all variables indicated by the user.
- variables: list, default=None
The list of variables to be imputed. If None, the imputer will find and select all variables in the dataframe.
Attributes
variables_:
List of variables for which the rows with NA will be deleted.
n_features_in_:
The number of features in the train set used in fit.
Methods
fit:
Learn the variables for which the rows with NA will be deleted
transform:
Remove observations with NA
fit_transform:
Fit to the data, then transform it.
return_na_data:
Returns the dataframe with the rows that contain NA .
- fit(X, y=None)[source]¶
Learn the variables for which the rows with NA will be deleted.
- Parameters
- X: pandas dataframe of shape = [n_samples, n_features]
The training dataset.
- y: pandas Series, default=None
y is not needed in this imputation. You can pass None or y.
- Returns
- self
- Raises
- TypeError
If the input is not a Pandas DataFrame
- return_na_data(X)[source]¶
Returns the subset of the dataframe which contains the rows with missing values. This method could be useful in production, in case we want to store the observations that will not be fed into the model.
- Parameters
- X: pandas dataframe of shape = [n_samples, n_features]
The dataframe to be transformed.
- Returns
- X: pandas dataframe of shape = [obs_with_na, features]
The dataframe containing only the rows with missing values.
- rtype
DataFrame
..
- Raises
- TypeError
If the input is not a Pandas DataFrame
- transform(X)[source]¶
Remove rows with missing values.
- Parameters
- X: pandas dataframe of shape = [n_samples, n_features]
The dataframe to be transformed.
- Returns
- X_transformed: pandas dataframe
The complete case dataframe for the selected variables, of shape [n_samples - rows_with_na, n_features]
- rtype
DataFrame
..
Example¶
DropMissingData() deletes rows with missing values. It works with numerical and categorical variables. You can pass a list of variables to impute, or the transformer will select and impute all variables. The trasformer has the option to learn the variables with missing data in the train set, and then remove observations with NA only in those variables.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.imputation import DropMissingData
# Load dataset
data = pd.read_csv('houseprice.csv')
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'],
test_size=0.3,
random_state=0)
# set up the imputer
missingdata_imputer = DropMissingData(variables=['LotFrontage', 'MasVnrArea'])
# fit the imputer
missingdata_imputer.fit(X_train)
# transform the data
train_t= missingdata_imputer.transform(X_train)
test_t= missingdata_imputer.transform(X_test)
# Number of NA before the transformation
X_train['LotFrontage'].isna().sum()
189
# Number of NA after the transformation:
train_t['LotFrontage'].isna().sum()
0
# Number of rows before and after transformation
print(X_train.shape)
print(train_t.shape)
(1022, 79)
(829, 79)