Preprocessors¶
-
class
text_classification.preprocessor.base.
BasePreprocessor
[source]¶ Bases:
abc.ABC
Base class that all preprocessor classes should inherit from to ensure uniformity. Classes inheriting from BasePreprocessor should consist of at least these three instance variables: train, test and dev. Each of these variables should contain the data split corresponding to its name. Each data split should be a list of dictionaries, where each dictionary represents one instance and contains the fields
text
andlabel
holding the instance’s corresponding value.self.train, self.test, self.dev = [ { "text": instance_1 text, "label": instance_1 label }, { "text": instance_2 text, "label": instance_2 label }, ... ]
-
class
text_classification.preprocessor.csv_preprocessor.
CSVPreprocessor
(train_filename=None, test_filename=None, dev_filename=None, test_split=0, dev_split=0, delimiter='\t', text_column='text', label_column='label', random_state=None)[source]¶ Bases:
text_classification.preprocessor.base.BasePreprocessor
Preprocessor that is able to read a csv-file and do train/test/dev split. A preprocessor instance serves as a samples storage whose instances can be extended with feature vectors and predictions.
-
__init__
(train_filename=None, test_filename=None, dev_filename=None, test_split=0, dev_split=0, delimiter='\t', text_column='text', label_column='label', random_state=None)[source]¶ - Parameters
train_filename (str) – Train set file.
test_filename (str) – Test set file.
dev_filename (str) – Dev set file.
test_split (float) – Fraction of train set that should be used as test set.
dev_split (float) – Fraction of train set that should be used as dev set.
delimiter (str) – Delimiter that is used in csv-file
text_column (str) – Column in csv-file containing text.
label_column (str) – Column in csv-file containing label.
random_state (int) – Random state for shuffling samples.
-
classmethod
from_file
(train_filename=None, test_filename=None, dev_filename=None, test_split=0, dev_split=0, delimiter='\t', text_column='text', label_column='label', random_state=None)[source]¶ Load samples from csv-files.
- Parameters
train_filename (str) – Train set file.
test_filename (str) – Test set file.
dev_filename (str) – Dev set file.
test_split (float) – Fraction of train set that should be used as test set.
dev_split (float) – Fraction of train set that should be used as dev set.
delimiter (str) – Delimiter that is used in csv-file.
text_column (str) – Column in csv-file containing text.
label_column (str) – Column in csv-file containing label.
random_state (int) – Random state for shuffling samples.
- Returns
CSVPreprocessor instance
-
get_data
()[source]¶ Returns a tuple containing train, test and dev set.
- Returns
Tuple with train, test and dev set.
-
write_csv
(filename, delimiter='\t', set='test')[source]¶ Write samples (i.e. text, label, prediction) to a csv-file.
- Parameters
filename (str) – File to write the samples to.
delimiter (str) – Delimiter that is used in csv-file.
set (str) – Which samples set to write. Possible values: “train”, “test”, “dev”
-
write_feature_vectors
(filename, delimiter='\t', set='train')[source]¶ Write extracted features to a csv-file.
- Parameters
filename (str) – File to write the feature vectors to.
delimiter (str) – Delimiter that is used in csv-file.
set (str) – From which samples set to write the feature vectors. Possible values: “train”, “test”, “dev”
-