mmcontext.models.mmcontextencoder.MMContextProcessor#
- class mmcontext.models.mmcontextencoder.MMContextProcessor(text_encoder_name, omics_lookup=None, *, prefix=_PREFIX)#
Bases:
objectJoint tokenizer for caption-and-omics batches.
Initially can be created as a text-only processor, with the ability to add omics processing capabilities later via register_initial_embeddings.
- Parameters:
text_encoder_name (str) – Name or path of a Hugging-Face checkpoint whose tokenizer is used for normal captions.
omics_lookup (Dict[str, int], optional) – Maps prefixed
sample_idxstrings to integer indices in the omics embedding matrix (row numbers). If not provided, the processor will only handle text inputs.prefix (str, optional) – Tag that distinguishes omics IDs from text (default:
"sample_idx:"). Only used if omics_lookup is provided.
Examples
>>> # Text-only processor >>> proc = MMContextProcessor("bert-base-uncased") >>> batch = ["A photo of a cat.", "Another text."] >>> enc = proc.tokenize(batch) >>> enc.keys() dict_keys(['input_ids', 'attention_mask', 'omics_text_info'])
>>> # With omics capabilities >>> proc = MMContextProcessor("bert-base-uncased", lookup_dict) >>> batch = ["sample_idx:42", "A photo of a cat."] >>> enc = proc.tokenize(batch) >>> enc.keys() dict_keys(['pixel_values', 'input_ids', 'attention_mask', 'omics_text_info'])
- tokenize(texts, *, padding=True, **tok_kwargs)#
Tokenize a batch of captions and/or omics identifiers. :rtype:
dict[str,Tensor]Text-only mode (no lookup table registered) → every element must be a string and is tokenised by the HF tokenizer.
Bimodal mode (lookup table present) → elements that start with
self.prefixare treated as omics strings. The prefix is stripped, the remainder is split by whitespace, and each resulting token is looked-up individually. Thus a sample may contain one or many omics IDs.
The method always returns
` input_ids, attention_mask, # ⟵ text part (if any) pixel_values (B, max_omics_len), # ⟵ PAD-filled omics_attention_mask (B, max_omics_len), # 1 = real, 0 = PAD omics_text_info (B,) # 1 = text, 0 = omics `even when a sample holds only a single omics ID.
- update_omics_lookup(omics_lookup, prefix=None)#
Update the processor with omics lookup capabilities.
Methods table#
|
Tokenize a batch of captions and/or omics identifiers. |
|
Update the processor with omics lookup capabilities. |
Methods#
- MMContextProcessor.tokenize(texts, *, padding=True, **tok_kwargs)#
Tokenize a batch of captions and/or omics identifiers. :rtype:
dict[str,Tensor]Text-only mode (no lookup table registered) → every element must be a string and is tokenised by the HF tokenizer.
Bimodal mode (lookup table present) → elements that start with
self.prefixare treated as omics strings. The prefix is stripped, the remainder is split by whitespace, and each resulting token is looked-up individually. Thus a sample may contain one or many omics IDs.
The method always returns
` input_ids, attention_mask, # ⟵ text part (if any) pixel_values (B, max_omics_len), # ⟵ PAD-filled omics_attention_mask (B, max_omics_len), # 1 = real, 0 = PAD omics_text_info (B,) # 1 = text, 0 = omics `even when a sample holds only a single omics ID.
- MMContextProcessor.update_omics_lookup(omics_lookup, prefix=None)#
Update the processor with omics lookup capabilities.