Libra: An open-source "Methodology Discovery" Library


Using Regual Expressions for Reading Molecular Information Files


One of the Libra's sub-library libra_py contains several modules that utilize the so-called regular expressions (regex). The regex use makes the reading files of different format and handling the information stored in them very conenient and transparent. In this tutorial I will explain how regex are used to read molecular structure, atomic data, and force field parameter files. The reader is now referred to 3 modules: LoadMolecule.py, LoadPT.py, and LoadUFF.py, all located here. There are also a few modules similar to LoadUFF.py to load the parameters for other force fields, but they utilize the same constructions used in the above 3 files.

All the noted files/module have a similar structure and are developed to perform a similar action - to load some data from a specifically-formattet files to different data structures. The loading is based on the concept of regular expressions. The regexes are defined in a hierarchical way - starting with the elementary patterns and combining them into more complex building blocks that contain a certain semantic value (which may be re-defined depending on the particular circumstances). Below, we will discuss these levels of the patterns (regular expressions).

The very basic data types, such as int, float, string, etc. can be defined using the built-in Python regex syntax as defined here and here.

#------- Here are some basic patterns -------------
INT    = '([1-9]([0-9]*))'
NINT   = '([0-9]+)'
SP     = '\s+'    
DOUBLE = '([-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?)'
WORD   = '([a-zA-Z]+)'
ID     = '(([a-zA-Z]+)([a-zA-Z]+|\d+)*)'
PHRASE = '"((\w|\W)+)"'
compINT = re.compile(INT)

For instance, we define an INT regex as a sequence of digits starting from non-zero digit followed by any number of any digits (including zero). So the integer numbers like 11101, 20, or 1010335 will match the pattern, but not anything like 007.

At the next level or complexity, we define named patterns:

#------- Here we define a format of file ----------
# p - means 'Pattern'
    pAtom_keyword = '(?P<Atom_keyword>'+'HETATM'+')'+SP
    pAtom_id      = '(?P<Atom_id>'+DOUBLE+')'+SP    
    pAtom_element = '(?P<Atom_element>'+WORD+')'+SP
    pAtom_chain   = '(?P<Atom_chain>'+WORD+')'+SP
    pAtom_id1     = '(?P<Atom_id1>'+DOUBLE+')'+SP
    pAtom_x_coord = '(?P<Atom_x_coord>'+DOUBLE+')'+SP
    pAtom_y_coord = '(?P<Atom_y_coord>'+DOUBLE+')'+SP
    pAtom_z_coord = '(?P<Atom_z_coord>'+DOUBLE+')'+SP
    pAtom_type    = '(?P<Atom_type>'+INT+')'+SP
    pAtom_charge  = '(?P<Atom_charge>'+DOUBLE+')'+SP

Note, that we simply wrap the basic patterns (using their names) in the extended construction, in which we also define the identifier of the given pattern. The identifier is given in the angle brackets and will later be used to access the value of the specific pattern of given type. You may have noticed that the same basic pattern (e.g. DOUBLE or WORD in our example) may have different identifyers associated with it. The identifyers hence serve the purpose of differentiating between patterns of the same type. For instance, lets say, the analysis suggested that you have 3 patterns of the DOUBLE type in your string. So, how are you going to say which DOUBLE contains which meaning? The identifiers do just that. In other words, the identifiers add a semantics (meaning) to the basic general patterns. In this way, we define meaningfull (carrying some semantics) patterns to be used in our analysis: pAtom_keyword, pAtom_id, pAtom_x_coord, pAtom_y_coord, etc.

But how do we know which of the found patterns should be assigned to which pattern identifier? Well, the most straightforward way is, of course, by the position of the patterns in some sequence of patterns. This is, basically, a format. To specify the format, we define the whole sequence of expected patterns in given order. For instance, the lines:

if format=="pdb":
    Atom_Record = pAtom_keyword + pAtom_id + pAtom_element + pAtom_id1 + pAtom_x_coord + pAtom_y_coord + pAtom_z_coord + pAtom_charge 
elif format=="pdb_1":
    Atom_Record = pAtom_keyword + pAtom_id + pAtom_element + pAtom_id1 + pAtom_x_coord + pAtom_y_coord + pAtom_z_coord + pAtom_chain
elif format=="true_pdb":
    Atom_Record = pAtom_keyword + pAtom_id + pAtom_element + pAtom_mol + pAtom_chain + pAtom_id1 + pAtom_x_coord + pAtom_y_coord + pAtom_z_coord + pAtom_occ + pAtom_charge 

define 3 formats for the Atom_Record lines in a files containing some structural information about the system. Note that these format lines are the minimal sets needed for the program to find such records. If the input line contains anything beyond the pattern, it will still be discovered. For instance, in the first format "pdb", we can put any comments or any additional information about the atom, right after final pAtom_charge entry. This will still be a valid input.

In the LoadMolecule.py module, in addition the above mentioned Atom_Record format line we also define the Bond_Record and Fragment_Record formats to search for different elements of input (connections between atoms, and grouping of the atoms in the rigid bodies, respectively).

Once all the format lines are defined, we are ready to analyze our test and try to find any matches that correspond to any of those defined format lines (extended patterns). First, we read in all the lines of the file into an array of strings, A and then we look for a particular pattern (in this example, Atom_Record) in each of these lines:

A = f.readlines()    
#---------- Create atoms ----------------
for a in A:        
    m1 = re.search(Atom_Record,a)
    if m1!=None:           
        ...

The result of the pattern search is a match object m1. It is equal to "None", if no pattern Atom_Record was found in the line a. Otherwise, the match object will contain all the information we need.

To access the information we need in a controlled way, we use the elementary pattern identifiers, together with "start()" and "end()" functions of the match object type:

atom_dict = {}
atom_dict["Atom_element"] = a[m1.start('Atom_element'):m1.end('Atom_element')]
atom_dict["Atom_cm_x"] = float(a[m1.start('Atom_x_coord'):m1.end('Atom_x_coord')]) * Angst_to_Bohr
atom_dict["Atom_cm_y"] = float(a[m1.start('Atom_y_coord'):m1.end('Atom_y_coord')]) * Angst_to_Bohr
atom_dict["Atom_cm_z"] = float(a[m1.start('Atom_z_coord'):m1.end('Atom_z_coord')]) * Angst_to_Bohr

This way we create a Python dictionary with specially designed keywords. The keywords: Atom_element, Atom_cm_x, etc. will be importnat in the next stage, namely, in the constructor of the Atom object of the "libchemobjects" library.

In passing, we should emphasize the method for inputting lists of integers (can be adapted to different data types). We need this in Bond_Record and in Fragment_Record. For instance, the bond record contains the list of integers such that the first one is the identifier of a given atom, and the rest are the ids of all atoms that are connected covalently to the given one. In Fragment_Record we simply input a list of the ids for the atoms that are grouped together to form a fragment.

For the case of Fragment_Record, the corresponding code to handle such lists will look like:

lst = compINT.findall(m3.group())
gr_atoms = []
i = 0
while i < len(lst):
    gr_atoms.append(int(float(lst[i][0])))
    i = i + 1

Here, we use "findall()" function of the regular expression object compINT. We also use the "group()" function of the match object to customize the "findall" function (other options would produce different level of patterning). The output is the list of entries. The first element of each entry is the integer we wanted to read. This is then used to simplify the list "lst" to a normal list of integers "gr_atoms".



The main difference between the 3 modules listed above comes in the types of Libra objects they are using. This also affect the specific way they are interfaced with Python script. Below we briefly describe the main mechanisms and object types.