Libra

Using Regual Expressions for Reading Molecular Information Files

One of the Libra's sub-library libra_py contains several modules that utilize the so-called regular expressions (regex). The regex use makes the reading files of different format and handling the information stored in them very conenient and transparent. In this tutorial I will explain how regex are used to read molecular structure, atomic data, and force field parameter files. The reader is now referred to 3 modules: LoadMolecule.py, LoadPT.py, and LoadUFF.py, all located here. There are also a few modules similar to LoadUFF.py to load the parameters for other force fields, but they utilize the same constructions used in the above 3 files.

All the noted files/module have a similar structure and are developed to perform a similar action - to load some data from a specifically-formattet files to different data structures. The loading is based on the concept of regular expressions. The regexes are defined in a hierarchical way - starting with the elementary patterns and combining them into more complex building blocks that contain a certain semantic value (which may be re-defined depending on the particular circumstances). Below, we will discuss these levels of the patterns (regular expressions).

The very basic data types, such as int, float, string, etc. can be defined using the built-in Python regex syntax as defined here and here.

#------- Here are some basic patterns -------------
INT    = '([1-9]([0-9]*))'
NINT   = '([0-9]+)'
SP     = '\s+'    
DOUBLE = '([-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?)'
WORD   = '([a-zA-Z]+)'
ID     = '(([a-zA-Z]+)([a-zA-Z]+|\d+)*)'
PHRASE = '"((\w|\W)+)"'
compINT = re.compile(INT)

For instance, we define an INT regex as a sequence of digits starting from non-zero digit followed by any number of any digits (including zero). So the integer numbers like 11101, 20, or 1010335 will match the pattern, but not anything like 007.

At the next level or complexity, we define named patterns:

#------- Here we define a format of file ----------
# p - means 'Pattern'
    pAtom_keyword = '(?P<Atom_keyword>'+'HETATM'+')'+SP
    pAtom_id      = '(?P<Atom_id>'+DOUBLE+')'+SP    
    pAtom_element = '(?P<Atom_element>'+WORD+')'+SP
    pAtom_chain   = '(?P<Atom_chain>'+WORD+')'+SP
    pAtom_id1     = '(?P<Atom_id1>'+DOUBLE+')'+SP
    pAtom_x_coord = '(?P<Atom_x_coord>'+DOUBLE+')'+SP
    pAtom_y_coord = '(?P<Atom_y_coord>'+DOUBLE+')'+SP
    pAtom_z_coord = '(?P<Atom_z_coord>'+DOUBLE+')'+SP
    pAtom_type    = '(?P<Atom_type>'+INT+')'+SP
    pAtom_charge  = '(?P<Atom_charge>'+DOUBLE+')'+SP

Note, that we simply wrap the basic patterns (using their names) in the extended construction, in which we also define the identifier of the given pattern. The identifier is given in the angle brackets and will later be used to access the value of the specific pattern of given type. You may have noticed that the same basic pattern (e.g. DOUBLE or WORD in our example) may have different identifyers associated with it. The identifyers hence serve the purpose of differentiating between patterns of the same type. For instance, lets say, the analysis suggested that you have 3 patterns of the DOUBLE type in your string. So, how are you going to say which DOUBLE contains which meaning? The identifiers do just that. In other words, the identifiers add a semantics (meaning) to the basic general patterns. In this way, we define meaningfull (carrying some semantics) patterns to be used in our analysis: pAtom_keyword, pAtom_id, pAtom_x_coord, pAtom_y_coord, etc.

But how do we know which of the found patterns should be assigned to which pattern identifier? Well, the most straightforward way is, of course, by the position of the patterns in some sequence of patterns. This is, basically, a format. To specify the format, we define the whole sequence of expected patterns in given order. For instance, the lines:

if format=="pdb":
    Atom_Record = pAtom_keyword + pAtom_id + pAtom_element + pAtom_id1 + pAtom_x_coord + pAtom_y_coord + pAtom_z_coord + pAtom_charge 
elif format=="pdb_1":
    Atom_Record = pAtom_keyword + pAtom_id + pAtom_element + pAtom_id1 + pAtom_x_coord + pAtom_y_coord + pAtom_z_coord + pAtom_chain
elif format=="true_pdb":
    Atom_Record = pAtom_keyword + pAtom_id + pAtom_element + pAtom_mol + pAtom_chain + pAtom_id1 + pAtom_x_coord + pAtom_y_coord + pAtom_z_coord + pAtom_occ + pAtom_charge

define 3 formats for the Atom_Record lines in a files containing some structural information about the system. Note that these format lines are the minimal sets needed for the program to find such records. If the input line contains anything beyond the pattern, it will still be discovered. For instance, in the first format "pdb", we can put any comments or any additional information about the atom, right after final pAtom_charge entry. This will still be a valid input.

In the LoadMolecule.py module, in addition the above mentioned Atom_Record format line we also define the Bond_Record and Fragment_Record formats to search for different elements of input (connections between atoms, and grouping of the atoms in the rigid bodies, respectively).

Once all the format lines are defined, we are ready to analyze our test and try to find any matches that correspond to any of those defined format lines (extended patterns). First, we read in all the lines of the file into an array of strings, A and then we look for a particular pattern (in this example, Atom_Record) in each of these lines:

A = f.readlines()    
#---------- Create atoms ----------------
for a in A:        
    m1 = re.search(Atom_Record,a)
    if m1!=None:           
        ...

The result of the pattern search is a match object m1. It is equal to "None", if no pattern Atom_Record was found in the line a. Otherwise, the match object will contain all the information we need.

To access the information we need in a controlled way, we use the elementary pattern identifiers, together with "start()" and "end()" functions of the match object type:

atom_dict = {}
atom_dict["Atom_element"] = a[m1.start('Atom_element'):m1.end('Atom_element')]
atom_dict["Atom_cm_x"] = float(a[m1.start('Atom_x_coord'):m1.end('Atom_x_coord')]) * Angst_to_Bohr
atom_dict["Atom_cm_y"] = float(a[m1.start('Atom_y_coord'):m1.end('Atom_y_coord')]) * Angst_to_Bohr
atom_dict["Atom_cm_z"] = float(a[m1.start('Atom_z_coord'):m1.end('Atom_z_coord')]) * Angst_to_Bohr

This way we create a Python dictionary with specially designed keywords. The keywords: Atom_element, Atom_cm_x, etc. will be importnat in the next stage, namely, in the constructor of the Atom object of the "libchemobjects" library.

In passing, we should emphasize the method for inputting lists of integers (can be adapted to different data types). We need this in Bond_Record and in Fragment_Record. For instance, the bond record contains the list of integers such that the first one is the identifier of a given atom, and the rest are the ids of all atoms that are connected covalently to the given one. In Fragment_Record we simply input a list of the ids for the atoms that are grouped together to form a fragment.

For the case of Fragment_Record, the corresponding code to handle such lists will look like:

lst = compINT.findall(m3.group())
gr_atoms = []
i = 0
while i < len(lst):
    gr_atoms.append(int(float(lst[i][0])))
    i = i + 1

Here, we use "findall()" function of the regular expression object compINT. We also use the "group()" function of the match object to customize the "findall" function (other options would produce different level of patterning). The output is the list of entries. The first element of each entry is the integer we wanted to read. This is then used to simplify the list "lst" to a normal list of integers "gr_atoms".

The main difference between the 3 modules listed above comes in the types of Libra objects they are using. This also affect the specific way they are interfaced with Python script. Below we briefly describe the main mechanisms and object types.

LoadMolecule.py The Python objects we construct in the above pattern analysis are used in the functions of the "libchemobjects" lirary:
syst.CREATE_ATOM( Atom(univ,atom_dict) ) syst.LINK_ATOMS(int(float(lst[0][0])),int(float(lst[i][0]))) syst.GROUP_ATOMS(gr_atoms,j)
The first function, CREATE_ATOM simply creates an Atom object, using the dictionary "atom_dict". The keys of the dictionary should be properly chosen so that the extraction of their values to the internal C++ variables would be meaningfull. See the constructor of objects of "Atom" type here for the full list of supported keys. The created atom becomes an element of Atoms array of the syst object.

The second function, links to atoms that belong to syst object (already created within it!). The linking actually updates the entire topology of the system, creating all bonds, angles, dihedrals, and improper dihedrals records, all stored inside the syst object. As we can see, the LINK_ATOM funciton takes only 2 parameters - the ids of the two atoms we want to link (note, ids start with 1: 1,2,3, ..., not with 0!).

Finally, the GROUP_ATOMS function groups the atoms that are already in syst object to create an element of Fragments array of the syst object. The arguments to the function are simply the Python list of integers, containing the ids (not indices!) of the atoms to group, and the integer that will become the Fragment id.

All 3 functions are defined in System_methods2.cpp file.
LoadPT.py This file implements a procedure to load data into the Periodic Table object defined within the Universe object (passed by reference as a variable U). The "Universe" class is designed to contain all general-purpose information about chemical species and physical parameters. It is used as a general database that contains such information. This database can be modified at every stage. It also used when creating different objects. So far, the main usage of the "Universe" class has been the storage of "PeriodicTable" variable. For details, see here. The "PeriodicTable" is essentially a map between the symbolic element name and the structure (of "Element" type) containing essential information about the element. Although, the database may be hard-coded, we prefer its loading from the corresponding data file. (Copy this file from the indicated location to the location of your tutorial/script, and modify as needed). This way, user may easily tune the properties of elements, to set them to even non-physical values (e.g. set some effective masses). As such, we need a way to load those data into the instances of the "Universe" class. This is exactly what the "LoadPT.py" script does.

Although the pattern recognition part is exactly the same as in LoadMolecule.py, the LoadPT.py script uses a different way to interface Python and C++ objects. Namely, we first create an object of any arbitrary Python class:
# First, declare the class (class name does not matter) class element: pass # Now, create an object elem = element()
As you can see, the class is empty - no data-members have been declared so far. However, one of the very interesting features of Python is the ability to define data members at later time - when they are first initialized with some value. In our example, this is done in the following lines:
elem.Elt_name = a[m1.start('Element_name'):m1.end('Element_name')] elem.Elt_id = int(float(a[m1.start('Element_atomic_number'):m1.end('Element_atomic_number')])) elem.Elt_number = elem.Elt_id elem.Elt_nucleus_charge = elem.Elt_id elem.Elt_mass = float(a[m1.start('Element_atomic_mass'):m1.end('Element_atomic_mass')]) * Mass_conv
We now create an instance of the "Element" class. This is the Libra class, which is exposed to Python. This is done in:
elt_record = Element()
Once the member-variables of the auxiliary Python class are defined and initialized, the Python object can be used as an input argument of the method of the "Element" class:
elt_record.set(elem)
This instruction will extract the existing variables of the elem object and assign corresponding elements of the elt_record object. In some sense, the procedure is similar to extracting values of the map by the names of the corresponding keys. The difference is that we know use another feature of Python (also available at the C++ level with Boost.Python) - the verification of the presece of some specifically-named attributes (variables) of the class. Thus, the names of fields of the auxiliary object elem should be properly selected. For the full list of available attribute names, see the "set()" function in Element.cpp.

Finally, once the properties of the instance elt_record of the class "Element" are setup, we can add this record to the PeriodicTable variable of the "Universe" class U.Add_Element_To_Periodic_Table(elt_record). The record object can also be printed via elt_record.show_info(). And, as you can see, the elements of the elt_record object can be accessed directly, e.g. print "load element", elem.Elt_name. To see the full list of exported data members and methods of the "Element" class, refer to the export file: libuniverse.cpp .

The convention of Libra is put all exported data and function members of exported classes into the ".cpp" file with the "lib" prefix.

LoadUFF.py This script implements methods to load the parameters of the UFF force field. Since UFF is rather generic, one only needs the atomic data. The data are stored in the uff.dat file. Hence, we only look for the Atom_Record patterns, as well as for the force-field-wide settings. The outcome of the function is the object of the "ForceField" class - force_field. The object contains the lists (arrays) of records of different types. In this case we only set up the records of the "Atom_Record" type. The Python/C++ interface for the data of these variables is realized via an auxiliary class "atomrecord", like in the LoadPT example above:

class atomrecord:
    pass
	
ff = atomrecord()
ff.Atom_ff_int_type = int(float(a[m1.start('Atom_ff_int_type'):m1.end('Atom_ff_int_type')]))
ff.Atom_radius      = float(a[m1.start('Atom_radius'):m1.end('Atom_radius')])
ff.Atom_theta       = float(a[m1.start('Atom_theta'):m1.end('Atom_theta')])
...

for a in A:
    m1 = re.search(UFF_Atom_Type_Record,a)

    atom_record = Atom_Record()
    atom_record.set(ff)
    res = force_field.Add_Atom_Record(atom_record)

The force-field-wide parameters are set up via the dictionary, like we did in LoadMolecule example above:

ff_par = {}
for a in A:
	m2 = re.search(FF_sigma_rule_record,a)
    m3 = re.search(FF_epsil_rule_record,a)
    ...
		
    if m2!=None:
        ff_par["sigma_comb_rule"] = a[m2.start('FF_sigma_rule_value'):m2.end('FF_sigma_rule_value')]
    if m3!=None:
        ff_par["epsilon_comb_rule"] = a[m3.start('FF_epsil_rule_value'):m3.end('FF_epsil_rule_value')]
	...
			
force_field.set(ff_par)

Libra: An open-source "Methodology Discovery" Library

Using Regual Expressions for Reading Molecular Information Files