Preprocessors

class text_classification.preprocessor.base.BasePreprocessor[source]

Bases: abc.ABC

Base class that all preprocessor classes should inherit from to ensure uniformity. Classes inheriting from BasePreprocessor should consist of at least these three instance variables: train, test and dev. Each of these variables should contain the data split corresponding to its name. Each data split should be a list of dictionaries, where each dictionary represents one instance and contains the fields text and label holding the instance’s corresponding value.

self.train, self.test, self.dev = [
    {
        "text": instance_1 text,
        "label": instance_1 label
    },
    {
        "text": instance_2 text,
        "label": instance_2 label
    },
    ...
]
abstract classmethod from_file(filename)[source]
abstract get_data()[source]
abstract get_dev_data()[source]
abstract get_test_data()[source]
abstract get_train_data()[source]
abstract write_csv(filename, delimiter)[source]
class text_classification.preprocessor.csv_preprocessor.CSVPreprocessor(train_filename=None, test_filename=None, dev_filename=None, test_split=0, dev_split=0, delimiter='\t', text_column='text', label_column='label', random_state=None)[source]

Bases: text_classification.preprocessor.base.BasePreprocessor

Preprocessor that is able to read a csv-file and do train/test/dev split. A preprocessor instance serves as a samples storage whose instances can be extended with feature vectors and predictions.

__init__(train_filename=None, test_filename=None, dev_filename=None, test_split=0, dev_split=0, delimiter='\t', text_column='text', label_column='label', random_state=None)[source]
Parameters
  • train_filename (str) – Train set file.

  • test_filename (str) – Test set file.

  • dev_filename (str) – Dev set file.

  • test_split (float) – Fraction of train set that should be used as test set.

  • dev_split (float) – Fraction of train set that should be used as dev set.

  • delimiter (str) – Delimiter that is used in csv-file

  • text_column (str) – Column in csv-file containing text.

  • label_column (str) – Column in csv-file containing label.

  • random_state (int) – Random state for shuffling samples.

classmethod from_file(train_filename=None, test_filename=None, dev_filename=None, test_split=0, dev_split=0, delimiter='\t', text_column='text', label_column='label', random_state=None)[source]

Load samples from csv-files.

Parameters
  • train_filename (str) – Train set file.

  • test_filename (str) – Test set file.

  • dev_filename (str) – Dev set file.

  • test_split (float) – Fraction of train set that should be used as test set.

  • dev_split (float) – Fraction of train set that should be used as dev set.

  • delimiter (str) – Delimiter that is used in csv-file.

  • text_column (str) – Column in csv-file containing text.

  • label_column (str) – Column in csv-file containing label.

  • random_state (int) – Random state for shuffling samples.

Returns

CSVPreprocessor instance

get_data()[source]

Returns a tuple containing train, test and dev set.

Returns

Tuple with train, test and dev set.

get_dev_data()[source]

Returns dev set.

Returns

Dev set.

get_test_data()[source]

Returns test set.

Returns

Test set.

Return type

List[dict]

get_train_data()[source]

Returns train set.

Returns

Train set.

write_csv(filename, delimiter='\t', set='test')[source]

Write samples (i.e. text, label, prediction) to a csv-file.

Parameters
  • filename (str) – File to write the samples to.

  • delimiter (str) – Delimiter that is used in csv-file.

  • set (str) – Which samples set to write. Possible values: “train”, “test”, “dev”

write_feature_vectors(filename, delimiter='\t', set='train')[source]

Write extracted features to a csv-file.

Parameters
  • filename (str) – File to write the feature vectors to.

  • delimiter (str) – Delimiter that is used in csv-file.

  • set (str) – From which samples set to write the feature vectors. Possible values: “train”, “test”, “dev”