immuneML.encodings package¶
Subpackages¶
- immuneML.encodings.abundance_encoding package
- Submodules
- immuneML.encodings.abundance_encoding.AbundanceEncoderHelper module
- immuneML.encodings.abundance_encoding.CompAIRRBatchIterator module
- immuneML.encodings.abundance_encoding.CompAIRRSequenceAbundanceEncoder module
CompAIRRSequenceAbundanceEncoder
CompAIRRSequenceAbundanceEncoder.LOG_FILENAME
CompAIRRSequenceAbundanceEncoder.OUTPUT_FILENAME
CompAIRRSequenceAbundanceEncoder.RELEVANT_SEQUENCE_ABUNDANCE
CompAIRRSequenceAbundanceEncoder.TOTAL_SEQUENCE_ABUNDANCE
CompAIRRSequenceAbundanceEncoder.build_object()
CompAIRRSequenceAbundanceEncoder.encode()
CompAIRRSequenceAbundanceEncoder.get_additional_files()
CompAIRRSequenceAbundanceEncoder.get_relevant_sequence_attributes()
CompAIRRSequenceAbundanceEncoder.get_sequence_set()
CompAIRRSequenceAbundanceEncoder.load_encoder()
CompAIRRSequenceAbundanceEncoder.set_context()
CompAIRRSequenceAbundanceEncoder.write_sequence_set_file()
- immuneML.encodings.abundance_encoding.KmerAbundanceEncoder module
- immuneML.encodings.abundance_encoding.SequenceAbundanceEncoder module
SequenceAbundanceEncoder
SequenceAbundanceEncoder.RELEVANT_SEQUENCE_ABUNDANCE
SequenceAbundanceEncoder.TOTAL_SEQUENCE_ABUNDANCE
SequenceAbundanceEncoder.build_object()
SequenceAbundanceEncoder.encode()
SequenceAbundanceEncoder.get_additional_files()
SequenceAbundanceEncoder.load_encoder()
SequenceAbundanceEncoder.set_context()
- Module contents
- immuneML.encodings.atchley_kmer_encoding package
- immuneML.encodings.deeprc package
- Submodules
- immuneML.encodings.deeprc.DeepRCEncoder module
DeepRCEncoder
DeepRCEncoder.COUNTS_COLUMN
DeepRCEncoder.EXTENSION
DeepRCEncoder.ID_COLUMN
DeepRCEncoder.METADATA_EXTENSION
DeepRCEncoder.METADATA_SEP
DeepRCEncoder.SEP
DeepRCEncoder.SEQUENCE_COLUMN
DeepRCEncoder.build_object()
DeepRCEncoder.encode()
DeepRCEncoder.export_metadata_file()
DeepRCEncoder.export_repertoire_tsv_files()
DeepRCEncoder.set_context()
- Module contents
- immuneML.encodings.distance_encoding package
- Submodules
- immuneML.encodings.distance_encoding.CompAIRRDistanceEncoder module
CompAIRRDistanceEncoder
CompAIRRDistanceEncoder.INPUT_FILENAME
CompAIRRDistanceEncoder.LOG_FILENAME
CompAIRRDistanceEncoder.OUTPUT_FILENAME
CompAIRRDistanceEncoder.build_distance_matrix()
CompAIRRDistanceEncoder.build_labels()
CompAIRRDistanceEncoder.build_object()
CompAIRRDistanceEncoder.encode()
CompAIRRDistanceEncoder.set_context()
- immuneML.encodings.distance_encoding.DistanceEncoder module
- immuneML.encodings.distance_encoding.DistanceMetricType module
- immuneML.encodings.distance_encoding.TCRdistEncoder module
- Module contents
- immuneML.encodings.evenness_profile package
- immuneML.encodings.kmer_frequency package
- Subpackages
- immuneML.encodings.kmer_frequency.sequence_encoding package
- Submodules
- immuneML.encodings.kmer_frequency.sequence_encoding.GappedKmerSequenceEncoder module
- immuneML.encodings.kmer_frequency.sequence_encoding.IMGTGappedKmerEncoder module
- immuneML.encodings.kmer_frequency.sequence_encoding.IMGTKmerSequenceEncoder module
- immuneML.encodings.kmer_frequency.sequence_encoding.IdentitySequenceEncoder module
- immuneML.encodings.kmer_frequency.sequence_encoding.KmerSequenceEncoder module
- immuneML.encodings.kmer_frequency.sequence_encoding.SequenceEncodingStrategy module
- immuneML.encodings.kmer_frequency.sequence_encoding.SequenceEncodingType module
- Module contents
- immuneML.encodings.kmer_frequency.sequence_encoding package
- Submodules
- immuneML.encodings.kmer_frequency.KmerFreqReceptorEncoder module
- immuneML.encodings.kmer_frequency.KmerFreqRepertoireEncoder module
- immuneML.encodings.kmer_frequency.KmerFreqSequenceEncoder module
- immuneML.encodings.kmer_frequency.KmerFrequencyEncoder module
KmerFrequencyEncoder
KmerFrequencyEncoder.STEP_ENCODED
KmerFrequencyEncoder.STEP_NORMALIZED
KmerFrequencyEncoder.STEP_SCALED
KmerFrequencyEncoder.STEP_VECTORIZED
KmerFrequencyEncoder.build_object()
KmerFrequencyEncoder.dataset_mapping
KmerFrequencyEncoder.encode()
KmerFrequencyEncoder.get_additional_files()
KmerFrequencyEncoder.scale_normalized()
- Module contents
- Subpackages
- immuneML.encodings.onehot package
- immuneML.encodings.preprocessing package
- immuneML.encodings.reference_encoding package
- Submodules
- immuneML.encodings.reference_encoding.MatchedReceptorsEncoder module
- immuneML.encodings.reference_encoding.MatchedReferenceUtil module
- immuneML.encodings.reference_encoding.MatchedRegexEncoder module
- immuneML.encodings.reference_encoding.MatchedRegexRepertoireEncoder module
- immuneML.encodings.reference_encoding.MatchedSequencesEncoder module
- immuneML.encodings.reference_encoding.SequenceMatchingSummaryType module
- Module contents
- immuneML.encodings.word2vec package
Submodules¶
immuneML.encodings.DatasetEncoder module¶
- class immuneML.encodings.DatasetEncoder.DatasetEncoder(name: str = None)[source]¶
Bases:
object
YAML specification:
- encodings:
e1: <encoder_class> # encoding without parameters
- e2:
- <encoder_class>: # encoding with parameters
parameter: value
- abstract static build_object(dataset: Dataset, **params)[source]¶
Creates an instance of the relevant subclass of the DatasetEncoder class using the given parameters. This method will be called during parsing time (early in the immuneML run), such that parameters and dataset type can be tested here.
The build_object method should do the following:
Check parameters: immuneML should crash if wrong user parameters are specified. The ParameterValidator utility class may be used for parameter testing.
Check the dataset type: immuneML should crash if the wrong dataset type is specified for this encoder. For example, DeepRCEncoder should only work for RepertoireDatasets and crash if the dataset is of another type.
Create an instance of the correct Encoder class, using the given parameters. Return this object. Some encoders have different subclasses depending on the dataset type. Make sure to return an instance of the correct subclass. For instance: KmerFrequencyEncoder has different subclasses for each dataset type. When the dataset is a Repertoire dataset, KmerFreqRepertoireEncoder should be returned.
- Parameters:
**params – keyword arguments that will be provided by users in the specification (if immuneML is used as a command line tool) or in the dictionary when calling the method from the code, and which should be used to create the Encoder object
- Returns:
the object of the appropriate Encoder class
- abstract encode(dataset, params: EncoderParams) Dataset [source]¶
This is the main encoding method of the Encoder. It takes in a given dataset, computes an EncodedData object, and returns a copy of the dataset with the attached EncodedData object.
- Parameters:
dataset – A dataset object (Sequence, Receptor or RepertoireDataset)
params – An EncoderParams object containing few utility parameters which may be used during encoding (e.g., number of parallel processes to use).
- Returns:
A copy of the original dataset, with an EncodedData object added to the dataset.encoded_data field.
- static get_additional_files() List[str] [source]¶
Should return a list with all the files that need to be stored when storing the encoder. For example, SimilarToPositiveSequenceEncoder stores all ‘positive’ sequences in the training data, and predicts a sequence to be ‘positive’ if it is similar to any positive sequences in the training data. In that case, these positive sequences are stored in a file.
For many encoders, it may not be necessary to store additional files.
- static load_attribute(encoder, encoder_file: Path, attribute: str)[source]¶
Utility method for loading correct file paths when loading an encoder (see: load_encoder). This method should not be overwritten.
- static load_encoder(encoder_file: Path)[source]¶
The load_encoder method can load the encoder given the folder where the same class of the model was previously stored using the store function. Encoders are stored in pickle format. If the encoder uses additional files, they should be explicitly loaded here as well.
If there are no additional files, this method does not need to be overwritten. If there are additional files, its contents should be as follows:
encoder = DatasetEncoder.load_encoder(encoder_file) encoder.my_additional_file = DatasetEncoder.load_attribute(encoder, encoder_file, “my_additional_file”)
- Parameters:
encoder_file (Path) – path to the encoder file where the encoder was stored using store() function
- Returns:
the loaded Encoder object
- set_context(context: dict)[source]¶
This method can be used to attach the full dataset (as part of a dictionary), as opposed to the dataset which is passed to the .encode() method. When training ML models, that data split is usually a training/validation subset of the total dataset.
In most cases, an encoder should only use the ‘dataset’ argument passed to the .encode() method to compute the encoded data. Using information from the full dataset, which includes the test data, may result in data leakage. For example, some encoders normalise the computed feature values (e.g., KmerFrequencyEncoder). Such normalised feature values should be based only on the current data split, and test data should remain unseen.
To avoid confusion about which version of the dataset to use, the full dataset is by default not attached, and attaching the full dataset should be done explicitly when required. For instance, if the encoded data is some kind of distance matrix (e.g., DistanceEncoder), the distance between examples in the training and test dataset should be included. Note that this does not entail data leakage: the test examples are not used to improve the computation of distances. The distances to test examples are determined by an algorithm which does not ‘learn’ from test data.
To explicitly enable using the full dataset in the encoder, the contents of this method should be as follows:
self.context = context return self
- Parameters:
context – a dictionary containing the full dataset
- store(encoded_dataset, params: EncoderParams)[source]¶
Stores the given encoded dataset. This method should not be overwritten.
- static store_encoder(encoder, encoder_file: Path)[source]¶
The store_encoder function stores the given encoder such that it can be imported later using load function. It uses pickle to store the Python object, as well as the additional filenames which should be returned by the get_additional_files() method.
This method should not be overwritten.
- Parameters:
encoder – the encoder object
encoder_file (Path) – path to the encoder file
- Returns:
the encoder file
immuneML.encodings.EncoderParams module¶
- class immuneML.encodings.EncoderParams.EncoderParams(result_path: pathlib.Path = None, label_config: immuneML.environment.LabelConfiguration.LabelConfiguration = None, pool_size: int = 4, model: dict = None, learn_model: bool = True, encode_labels: bool = True, sequence_type: immuneML.environment.SequenceType.SequenceType = <SequenceType.AMINO_ACID: 'sequence_aa'>, region_type: immuneML.data_model.SequenceParams.RegionType = <RegionType.IMGT_CDR3: 'cdr3'>)[source]¶
Bases:
object
- encode_labels: bool = True¶
- label_config: LabelConfiguration = None¶
- learn_model: bool = True¶
- model: dict = None¶
- pool_size: int = 4¶
- region_type: RegionType = 'cdr3'¶
- result_path: Path = None¶
- sequence_type: SequenceType = 'sequence_aa'¶