Featurizers¶
-
class
text_classification.featurizer.base.
BaseFeaturizer
[source]¶ Bases:
abc.ABC
Base class that all featurizer classes should inherit from to ensure uniformity.
-
COARSE_POS_TAGS
= ['ADJ', 'ADP', 'ADV', 'AUX', 'CONJ', 'CCONJ', 'DET', 'INTJ', 'NOUN', 'NUM', 'PART', 'PRON', 'PROPN', 'PUNCT', 'SCONJ', 'SYM', 'VERB', 'X', 'SPACE']¶
-
abstract
add_feature
(feature_extraction_function)[source]¶ Should add a custom feature extraction function to the ones defined in the inheriting class.
- Parameters
feature_extraction_function (function) – Custom function that extracts features from text.
-
abstract
extract_features
(preprocessor, exclude={})[source]¶ Should add two fields to the preprocessor’s instances:
feature_vector
andfeature_names
, wherefeature_vector
is a list of numerical values (=feature values) andfeature_names
is a list of strings (=brief description of features).- Parameters
preprocessor (BasePreprocessor) – Preprocessor containing samples to featurize.
exclude – Set of features that should be excluded from resulting feature vectors.
exclude – Set[str]
-
-
class
text_classification.featurizer.tweet_featurizer.
TweetFeaturizer
(lang_model='en_core_web_sm', normalize=True)[source]¶ Bases:
text_classification.featurizer.base.BaseFeaturizer
Featurizer that extracts features from tweets, i.e. it doesn’t contain any paragraph-based features as these don’t apply for tweets.
-
__init__
(lang_model='en_core_web_sm', normalize=True)[source]¶ Instantiates a TweetFeaturizer instance.
- Parameters
lang_model (str) – A spaCy language model name.
normalize (bool) – Whether to normalize the features based on number of chars/tokens.
-
add_feature
(feature_extraction_function)[source]¶ Adds a custom feature extraction function to the predefined ones. The feature extraction function must take as input a dictionary containing a key ‘text’ and return a dict with with ‘feature_names’ and ‘feature_vector’ as keys.
- Parameters
feature_extraction_function (function) – Custom function that extracts features from text.
-
extract_features
(preprocessor, exclude={})[source]¶ Extracts the features for all splits in the preprocessor and adds feature vector and feature name for each instance in-place.
- Parameters
preprocessor (BasePreprocessor) – Preprocessor containing samples to featurize.
exclude – Set of features that should be excluded from resulting feature vectors.
exclude – Set[str]
-