Featurizers¶

class text_classification.featurizer.base.BaseFeaturizer[source]¶

Bases: abc.ABC

Base class that all featurizer classes should inherit from to ensure uniformity.

COARSE_POS_TAGS = ['ADJ', 'ADP', 'ADV', 'AUX', 'CONJ', 'CCONJ', 'DET', 'INTJ', 'NOUN', 'NUM', 'PART', 'PRON', 'PROPN', 'PUNCT', 'SCONJ', 'SYM', 'VERB', 'X', 'SPACE']¶

abstract add_feature(feature_extraction_function)[source]¶

Should add a custom feature extraction function to the ones defined in the inheriting class.

Parameters: feature_extraction_function (function) – Custom function that extracts features from text.

abstract extract_features(preprocessor, exclude={})[source]¶

Should add two fields to the preprocessor’s instances: feature_vector and feature_names, where feature_vector is a list of numerical values (=feature values) and feature_names is a list of strings (=brief description of features).

Parameters

preprocessor (BasePreprocessor) – Preprocessor containing samples to featurize.
exclude – Set of features that should be excluded from resulting feature vectors.
exclude – Set[str]

classmethod load(filename)[source]¶

Loads a previously saved featurizer from a binary file.

Parameters: filename (str) – Name of the binary file that the featurizer should be loaded from.
Returns: Classifier instance.

save(filename)[source]¶

Saves current featurizer instance in binary format.

Parameters: filename (str) – Name of the file where the featurizer should be saved.

class text_classification.featurizer.tweet_featurizer.TweetFeaturizer(lang_model='en_core_web_sm', normalize=True)[source]¶

Bases: text_classification.featurizer.base.BaseFeaturizer

Featurizer that extracts features from tweets, i.e. it doesn’t contain any paragraph-based features as these don’t apply for tweets.

__init__(lang_model='en_core_web_sm', normalize=True)[source]¶

Instantiates a TweetFeaturizer instance.

Parameters

lang_model (str) – A spaCy language model name.
normalize (bool) – Whether to normalize the features based on number of chars/tokens.

add_feature(feature_extraction_function)[source]¶

Adds a custom feature extraction function to the predefined ones. The feature extraction function must take as input a dictionary containing a key ‘text’ and return a dict with with ‘feature_names’ and ‘feature_vector’ as keys.

Parameters: feature_extraction_function (function) – Custom function that extracts features from text.

extract_features(preprocessor, exclude={})[source]¶

Extracts the features for all splits in the preprocessor and adds feature vector and feature name for each instance in-place.

Parameters

preprocessor (BasePreprocessor) – Preprocessor containing samples to featurize.
exclude – Set of features that should be excluded from resulting feature vectors.
exclude – Set[str]