Featurizers

class text_classification.featurizer.base.BaseFeaturizer[source]

Bases: abc.ABC

Base class that all featurizer classes should inherit from to ensure uniformity.

COARSE_POS_TAGS = ['ADJ', 'ADP', 'ADV', 'AUX', 'CONJ', 'CCONJ', 'DET', 'INTJ', 'NOUN', 'NUM', 'PART', 'PRON', 'PROPN', 'PUNCT', 'SCONJ', 'SYM', 'VERB', 'X', 'SPACE']
abstract add_feature(feature_extraction_function)[source]

Should add a custom feature extraction function to the ones defined in the inheriting class.

Parameters

feature_extraction_function (function) – Custom function that extracts features from text.

abstract extract_features(preprocessor, exclude={})[source]

Should add two fields to the preprocessor’s instances: feature_vector and feature_names, where feature_vector is a list of numerical values (=feature values) and feature_names is a list of strings (=brief description of features).

Parameters
  • preprocessor (BasePreprocessor) – Preprocessor containing samples to featurize.

  • exclude – Set of features that should be excluded from resulting feature vectors.

  • exclude – Set[str]

classmethod load(filename)[source]

Loads a previously saved featurizer from a binary file.

Parameters

filename (str) – Name of the binary file that the featurizer should be loaded from.

Returns

Classifier instance.

save(filename)[source]

Saves current featurizer instance in binary format.

Parameters

filename (str) – Name of the file where the featurizer should be saved.

class text_classification.featurizer.tweet_featurizer.TweetFeaturizer(lang_model='en_core_web_sm', normalize=True)[source]

Bases: text_classification.featurizer.base.BaseFeaturizer

Featurizer that extracts features from tweets, i.e. it doesn’t contain any paragraph-based features as these don’t apply for tweets.

__init__(lang_model='en_core_web_sm', normalize=True)[source]

Instantiates a TweetFeaturizer instance.

Parameters
  • lang_model (str) – A spaCy language model name.

  • normalize (bool) – Whether to normalize the features based on number of chars/tokens.

add_feature(feature_extraction_function)[source]

Adds a custom feature extraction function to the predefined ones. The feature extraction function must take as input a dictionary containing a key ‘text’ and return a dict with with ‘feature_names’ and ‘feature_vector’ as keys.

Parameters

feature_extraction_function (function) – Custom function that extracts features from text.

extract_features(preprocessor, exclude={})[source]

Extracts the features for all splits in the preprocessor and adds feature vector and feature name for each instance in-place.

Parameters
  • preprocessor (BasePreprocessor) – Preprocessor containing samples to featurize.

  • exclude – Set of features that should be excluded from resulting feature vectors.

  • exclude – Set[str]