mmcontext.models.mmcontextencoder.MMContextProcessor#

class mmcontext.models.mmcontextencoder.MMContextProcessor(text_encoder_name, omics_lookup=None, *, prefix=_PREFIX)#

Bases: object

Joint tokenizer for caption-and-omics batches.

Initially can be created as a text-only processor, with the ability to add omics processing capabilities later via register_initial_embeddings.

Parameters:

text_encoder_name (str) – Name or path of a Hugging-Face checkpoint whose tokenizer is used for normal captions.
omics_lookup (Dict[str, int], optional) – Maps prefixed sample_idx strings to integer indices in the omics embedding matrix (row numbers). If not provided, the processor will only handle text inputs.
prefix (str, optional) – Tag that distinguishes omics IDs from text (default: "sample_idx:"). Only used if omics_lookup is provided.

Examples

>>> # Text-only processor
>>> proc = MMContextProcessor("bert-base-uncased")
>>> batch = ["A photo of a cat.", "Another text."]
>>> enc = proc.tokenize(batch)
>>> enc.keys()
dict_keys(['input_ids', 'attention_mask', 'omics_text_info'])

>>> # With omics capabilities
>>> proc = MMContextProcessor("bert-base-uncased", lookup_dict)
>>> batch = ["sample_idx:42", "A photo of a cat."]
>>> enc = proc.tokenize(batch)
>>> enc.keys()
dict_keys(['pixel_values', 'input_ids', 'attention_mask', 'omics_text_info'])

tokenize(texts, *, padding=True, **tok_kwargs)#

Tokenize a batch of captions and/or omics identifiers. :rtype: dict[str, Tensor]

Text-only mode (no lookup table registered) → every element must be a string and is tokenised by the HF tokenizer.
Bimodal mode (lookup table present) → elements that start with self.prefix are treated as omics strings. The prefix is stripped, the remainder is split by whitespace, and each resulting token is looked-up individually. Thus a sample may contain one or many omics IDs.

The method always returns

` input_ids, attention_mask, # ⟵ text part (if any) pixel_values (B, max_omics_len), # ⟵ PAD-filled omics_attention_mask (B, max_omics_len), # 1 = real, 0 = PAD omics_text_info (B,) # 1 = text, 0 = omics ` even when a sample holds only a single omics ID.

Raises:

TypeError – If an element is not a string.
KeyError – If an omics token is missing from the lookup table.

update_omics_lookup(omics_lookup, prefix=None)#

Update the processor with omics lookup capabilities.

Parameters:

omics_lookup (Dict[str, int]) – Maps prefixed sample_idx strings to integer indices.
prefix (str, optional) – Tag that distinguishes omics IDs from text. If not provided, the existing prefix will be used.

Methods table#

`tokenize`(texts, *[, padding])	Tokenize a batch of captions and/or omics identifiers.
`update_omics_lookup`(omics_lookup[, prefix])	Update the processor with omics lookup capabilities.

Methods#

MMContextProcessor.tokenize(texts, *, padding=True, **tok_kwargs)#

Tokenize a batch of captions and/or omics identifiers. :rtype: dict[str, Tensor]

Text-only mode (no lookup table registered) → every element must be a string and is tokenised by the HF tokenizer.
Bimodal mode (lookup table present) → elements that start with self.prefix are treated as omics strings. The prefix is stripped, the remainder is split by whitespace, and each resulting token is looked-up individually. Thus a sample may contain one or many omics IDs.

The method always returns

Raises:

TypeError – If an element is not a string.
KeyError – If an omics token is missing from the lookup table.

MMContextProcessor.update_omics_lookup(omics_lookup, prefix=None)#

Update the processor with omics lookup capabilities.

Parameters:

omics_lookup (Dict[str, int]) – Maps prefixed sample_idx strings to integer indices.
prefix (str, optional) – Tag that distinguishes omics IDs from text. If not provided, the existing prefix will be used.

mmcontext.models.mmcontextencoder.MMContextProcessor

Contents

mmcontext.models.mmcontextencoder.MMContextProcessor#

Methods table#

Methods#