mmcontext.models.mmcontextencoder.MMContextProcessor#

class mmcontext.models.mmcontextencoder.MMContextProcessor(text_encoder_name, omics_lookup=None, *, prefix=_PREFIX)#

Bases: object

Joint tokenizer for caption-and-omics batches.

Initially can be created as a text-only processor, with the ability to add omics processing capabilities later via register_initial_embeddings.

Parameters:
  • text_encoder_name (str) – Name or path of a Hugging-Face checkpoint whose tokenizer is used for normal captions.

  • omics_lookup (Dict[str, int], optional) – Maps prefixed sample_idx strings to integer indices in the omics embedding matrix (row numbers). If not provided, the processor will only handle text inputs.

  • prefix (str, optional) – Tag that distinguishes omics IDs from text (default: "sample_idx:"). Only used if omics_lookup is provided.

Examples

>>> # Text-only processor
>>> proc = MMContextProcessor("bert-base-uncased")
>>> batch = ["A photo of a cat.", "Another text."]
>>> enc = proc.tokenize(batch)
>>> enc.keys()
dict_keys(['input_ids', 'attention_mask', 'omics_text_info'])
>>> # With omics capabilities
>>> proc = MMContextProcessor("bert-base-uncased", lookup_dict)
>>> batch = ["sample_idx:42", "A photo of a cat."]
>>> enc = proc.tokenize(batch)
>>> enc.keys()
dict_keys(['pixel_values', 'input_ids', 'attention_mask', 'omics_text_info'])
tokenize(texts, *, padding=True, **tok_kwargs)#

Tokenize a batch of captions and/or omics identifiers. :rtype: dict[str, Tensor]

  • Text-only mode (no lookup table registered) → every element must be a string and is tokenised by the HF tokenizer.

  • Bimodal mode (lookup table present) → elements that start with self.prefix are treated as omics strings. The prefix is stripped, the remainder is split by whitespace, and each resulting token is looked-up individually. Thus a sample may contain one or many omics IDs.

The method always returns

` input_ids, attention_mask,                  # text part (if any) pixel_values              (B, max_omics_len),  # PAD-filled omics_attention_mask   (B, max_omics_len),  # 1 = real, 0 = PAD omics_text_info        (B,)                 # 1 = text, 0 = omics ` even when a sample holds only a single omics ID.

Raises:
  • TypeError – If an element is not a string.

  • KeyError – If an omics token is missing from the lookup table.

update_omics_lookup(omics_lookup, prefix=None)#

Update the processor with omics lookup capabilities.

Parameters:
  • omics_lookup (Dict[str, int]) – Maps prefixed sample_idx strings to integer indices.

  • prefix (str, optional) – Tag that distinguishes omics IDs from text. If not provided, the existing prefix will be used.

Methods table#

tokenize(texts, *[, padding])

Tokenize a batch of captions and/or omics identifiers.

update_omics_lookup(omics_lookup[, prefix])

Update the processor with omics lookup capabilities.

Methods#

MMContextProcessor.tokenize(texts, *, padding=True, **tok_kwargs)#

Tokenize a batch of captions and/or omics identifiers. :rtype: dict[str, Tensor]

  • Text-only mode (no lookup table registered) → every element must be a string and is tokenised by the HF tokenizer.

  • Bimodal mode (lookup table present) → elements that start with self.prefix are treated as omics strings. The prefix is stripped, the remainder is split by whitespace, and each resulting token is looked-up individually. Thus a sample may contain one or many omics IDs.

The method always returns

` input_ids, attention_mask,                  # text part (if any) pixel_values              (B, max_omics_len),  # PAD-filled omics_attention_mask   (B, max_omics_len),  # 1 = real, 0 = PAD omics_text_info        (B,)                 # 1 = text, 0 = omics ` even when a sample holds only a single omics ID.

Raises:
  • TypeError – If an element is not a string.

  • KeyError – If an omics token is missing from the lookup table.

MMContextProcessor.update_omics_lookup(omics_lookup, prefix=None)#

Update the processor with omics lookup capabilities.

Parameters:
  • omics_lookup (Dict[str, int]) – Maps prefixed sample_idx strings to integer indices.

  • prefix (str, optional) – Tag that distinguishes omics IDs from text. If not provided, the existing prefix will be used.