prxteinmpnn.io#

Utilities for processing structure and trajectory files.

prxteinmpnn.io._check_if_file_empty(file_path)[source]#

Check if the file is empty.

Return type:

bool

Parameters:

file_path (str)

prxteinmpnn.io.string_key_to_index(string_keys, key_map, unk_index=None)[source]#

Convert string keys to integer indices based on a mapping.

Efficient vectorized implementation to convert a 1D array of string keys to a 1D array of integer indices using a provided mapping. If a key is not found in the mapping, it is replaced with a specified unknown index.

Parameters:
  • string_keys (ndarray) – A 1D array of string keys.

  • key_map (Mapping[str, int]) – A dictionary mapping string keys to integer indices.

  • unk_index (int | None) – The index to use for unknown keys not found in the mapping. If None, uses the length of the key_map as the unknown index.

Return type:

Array

Returns:

A 1D array of integer indices corresponding to the string keys.

prxteinmpnn.io.string_to_protein_sequence(sequence, aa_map=None, unk_index=None)[source]#

Convert a string sequence to a ProteinSequence.

Parameters:
  • sequence (str) – A string containing the protein sequence.

  • aa_map (dict | None) – A dictionary mapping amino acid names to integer indices. If None, uses the default restype_order mapping.

  • unk_index (int | None) – The index to use for unknown amino acids not found in the mapping. If None, uses unk_restype_index.

Return type:

Int[Array, 'num_residues']

Returns:

A ProteinSequence containing the amino acid type indices corresponding to the input string.

prxteinmpnn.io.protein_sequence_to_string(sequence, aa_map=None)[source]#

Convert a ProteinSequence to a string.

Parameters:
  • sequence (Int[Array, 'num_residues']) – A ProteinSequence containing amino acid type indices.

  • aa_map (dict | None) – A dictionary mapping amino acid type indices to their corresponding names. If None, uses the default restype_order mapping.

Return type:

str

Returns:

A string representation of the protein sequence.

prxteinmpnn.io.residue_names_to_aatype(residue_names, aa_map=None)[source]#

Convert 3-letter residue names to amino acid type indices.

Parameters:
  • residue_names (ndarray) – A 1D array of residue names (strings).

  • aa_map (dict | None) – A dictionary mapping residue names to integer indices. If None, uses the default resname_to_idx mapping.

Return type:

Int[Array, 'num_residues']

Returns:

A 1D array of amino acid type indices corresponding to the residue names.

prxteinmpnn.io.atom_names_to_index(atom_names, atom_map=None)[source]#

Convert atom names to atom type indices.

Parameters:
  • atom_names (ndarray) – A 1D array of atom names (strings).

  • atom_map (dict | None) – A dictionary mapping atom names to integer indices. If None, uses the default atomname_to_idx mapping.

Return type:

Int[Array, 'num_residues']

Returns:

A 1D array of atom type indices corresponding to the atom names.

prxteinmpnn.io._check_atom_array_length(atom_array)[source]#

Check if the AtomArray has a valid length.

Parameters:

atom_array (AtomArray) – The AtomArray to check.

Raises:

ValueError – If the AtomArray is empty.

Return type:

None

prxteinmpnn.io._get_chain_index(atom_array)[source]#

Get the chain index from the AtomArray.

Return type:

Int[Array, 'num_residues num_atoms']

Parameters:

atom_array (AtomArray)

prxteinmpnn.io._process_chain_id(atom_array, chain_id=None)[source]#

Process the chain_id of the AtomArray.

Return type:

tuple[AtomArray, Int[Array, 'num_residues num_atoms']]

Parameters:
prxteinmpnn.io._fill_in_cb_coordinates(coords_37, residue_names, atom_map=None)[source]#

Fill in the CB coordinates for residues that have them.

Parameters:
  • coords_37 (Array) – A 2D array of shape (N, 37, 3) containing the coordinates of the atoms.

  • residue_names (ndarray) – A 1D array of residue names corresponding to the coordinates.

  • atom_map (dict[str, int] | None) – A dictionary mapping residue names to their atom indices. If None, uses the default atom_order mapping.

Return type:

Array

Returns:

A 2D array of shape (N, 37, 3) with the C-beta coordinates filled in for residues that have

them.

For glycine residues, the C-beta coordinates are computed precisely based on the N, CA, and C

atoms.

For other residues, the original C-beta coordinates are retained if they exist.

NOTE: This is not part of the pipeline, as despite this happening in the original code, it is

bypassed during feature extraction.

prxteinmpnn.io.process_atom_array(atom_array, atom_map=None, chain_id=None)[source]#

Process an AtomArray to create a ProteinStructure.

Return type:

ProteinStructure

Parameters:
prxteinmpnn.io.from_structure_file(file_path, model=1, chain_id=None)[source]#

Construct a Protein object from a structure file (PDB, PDBx/mmCIF).

This implementation uses biotite for robust parsing and JAX for efficient vectorized processing to create a dense, fixed-size representation for each residue (37 atoms).

WARNING: All non-standard residue types will be converted into UNK. All

atoms not in the canonical 37-atom set will be ignored.

Parameters:
  • file_path (str) – The path to the structure file.

  • model (int) – The model number to load from the structure file. Defaults to 1.

  • chain_id (str | Sequence[str] | None) – If specified, only this chain is parsed. If None, the entire structure is parsed.

Return type:

ProteinStructure

Returns:

A new ProteinStructure parsed from the file contents.

prxteinmpnn.io.from_trajectory(trajectory_file, topology_file=None, chain_id=None)[source]#

Construct ProteinStructure objects from a trajectory file.

This function reads a trajectory and yields a ProteinStructure for each frame.

Parameters:
  • trajectory_file (str) – Path to the trajectory file (e.g., DCD, XTC, multi-model PDB).

  • topology_file (str | None) – Path to the topology file (e.g., PDB, PSF), required for coordinate-only trajectory formats.

  • chain_id (str | Sequence[str] | None) – If specified, only atoms from this chain will be included.

Return type:

Iterator[ProteinStructure]

Returns:

An iterator that yields a ProteinStructure for each frame in the trajectory.

prxteinmpnn.io.from_string(pdb_string, model=1, chain_id=None)[source]#

Construct a ProteinStructure from a PDB string.

Parameters:
  • pdb_string (str) – The PDB formatted string.

  • model (int) – The model number to load from the structure string. Defaults to 1.

  • chain_id (str | Sequence[str] | None) – If specified, only this chain is parsed. If None, the entire structure is parsed.

Return type:

ProteinStructure

Returns:

A new ProteinStructure parsed from the PDB string.

prxteinmpnn.io.protein_structure_to_model_inputs(protein_structure, bias=None)[source]#

Convert a ProteinStructure to model inputs.

Parameters:
  • protein_structure (ProteinStructure) – A ProteinStructure object containing the structure data.

  • bias (Float[Array, 'num_residues num_classes'] | None) – An optional InputBias jnp.ndarray with shape (num_residues, 20) containing

  • Default (bias information. This will shift output probabilities for each residue.)

  • zero. (to)

Return type:

ModelInputs

Returns:

A ModelInputs containing the model inputs derived from the ProteinStructure.