Preprocessors¶

class text_classification.preprocessor.base.BasePreprocessor[source]¶

Bases: abc.ABC

Base class that all preprocessor classes should inherit from to ensure uniformity. Classes inheriting from BasePreprocessor should consist of at least these three instance variables: train, test and dev. Each of these variables should contain the data split corresponding to its name. Each data split should be a list of dictionaries, where each dictionary represents one instance and contains the fields text and label holding the instance’s corresponding value.

self.train, self.test, self.dev = [
    {
        "text": instance_1 text,
        "label": instance_1 label
    },
    {
        "text": instance_2 text,
        "label": instance_2 label
    },
    ...
]

abstract classmethod from_file(filename)[source]¶

abstract get_data()[source]¶

abstract get_dev_data()[source]¶

abstract get_test_data()[source]¶

abstract get_train_data()[source]¶

abstract write_csv(filename, delimiter)[source]¶

class text_classification.preprocessor.csv_preprocessor.CSVPreprocessor(train_filename=None, test_filename=None, dev_filename=None, test_split=0, dev_split=0, delimiter='\t', text_column='text', label_column='label', random_state=None)[source]¶

Bases: text_classification.preprocessor.base.BasePreprocessor

Preprocessor that is able to read a csv-file and do train/test/dev split. A preprocessor instance serves as a samples storage whose instances can be extended with feature vectors and predictions.

__init__(train_filename=None, test_filename=None, dev_filename=None, test_split=0, dev_split=0, delimiter='\t', text_column='text', label_column='label', random_state=None)[source]¶

Parameters

train_filename (str) – Train set file.
test_filename (str) – Test set file.
dev_filename (str) – Dev set file.
test_split (float) – Fraction of train set that should be used as test set.
dev_split (float) – Fraction of train set that should be used as dev set.
delimiter (str) – Delimiter that is used in csv-file
text_column (str) – Column in csv-file containing text.
label_column (str) – Column in csv-file containing label.
random_state (int) – Random state for shuffling samples.

classmethod from_file(train_filename=None, test_filename=None, dev_filename=None, test_split=0, dev_split=0, delimiter='\t', text_column='text', label_column='label', random_state=None)[source]¶

Load samples from csv-files.

Parameters

train_filename (str) – Train set file.
test_filename (str) – Test set file.
dev_filename (str) – Dev set file.
test_split (float) – Fraction of train set that should be used as test set.
dev_split (float) – Fraction of train set that should be used as dev set.
delimiter (str) – Delimiter that is used in csv-file.
text_column (str) – Column in csv-file containing text.
label_column (str) – Column in csv-file containing label.
random_state (int) – Random state for shuffling samples.

Returns

CSVPreprocessor instance

get_data()[source]¶

Returns a tuple containing train, test and dev set.

Returns: Tuple with train, test and dev set.

get_dev_data()[source]¶

Returns dev set.

Returns: Dev set.

get_test_data()[source]¶

Returns test set.

Returns: Test set.
Return type: List[dict]

get_train_data()[source]¶

Returns train set.

Returns: Train set.

write_csv(filename, delimiter='\t', set='test')[source]¶

Write samples (i.e. text, label, prediction) to a csv-file.

Parameters

filename (str) – File to write the samples to.
delimiter (str) – Delimiter that is used in csv-file.
set (str) – Which samples set to write. Possible values: “train”, “test”, “dev”

write_feature_vectors(filename, delimiter='\t', set='train')[source]¶

Write extracted features to a csv-file.

Parameters

filename (str) – File to write the feature vectors to.
delimiter (str) – Delimiter that is used in csv-file.
set (str) – From which samples set to write the feature vectors. Possible values: “train”, “test”, “dev”