4 Sequence Tools

There are mainly two types of tools in Ordalie: tools that query or add information (Search, Identity, Tree, Conservation, ...) and tools that change the current snapshot (Editor, Cluster, Feature etc ...). There are shortcuts that allow to enter a tool using the keyboard :

Key Tool
<Shift + A>, <A> Annotation
<Shift + C>, <C> Conservation
<Shift + E>, <E> Editor
<Shift + F>, <F> Feature tool
<Shift + G>, <G> Clustering
<Shift + I>, <I> Identity
<Shift + M>, <M> Search motif
<Shift + S>, <S> Superpose
<Shift + T>, <T> Tree building

4.1 The Identity tool

This tool is used to query information on identity percentages between sequences.

4.1.1 Control panel

Figure 3: The control panel of the Identity tool
Image identity_controlpanel . Selection

The identity percentage can be computed for some sequences and over a user defined residue range. The left part of the Control panel deals with sequences and residue range selection. Computation and Results

The 'Compute' button calculates the identity percentage between selected sequences for the selected residue range. A summary of the computation is logged. The selection of two sequences for which the identity percentage is desired is done with the following two comboboxes. The identity percentage and the length of the two ungapped sequences is then given.

The 'Summary' button will make a window appear that will give for the whole sequence and for each group :

The 'Return' button will leave the Identity tool.

4.2 The Search motif tool

This tool allows the user to search for a particular sequence motif inside the alignment.

4.2.1 The pattern syntax

The syntax of the search pattern follows the rules of the FindPatterns program of the GCG Wisconsin Package [15]. The following subsections are adapted from the FindPatterns documentation. Basic syntaxic rules

The search pattern can include any legal sequence character, and also include several non-sequence characters, which are used to specify 'OR' matching, 'NOT' matching, 'begin' and 'end' constraints, and repeat counts. For instance, the pattern GASTE(X){20,30}FTG means searching GASTE, followed by 20 to 30 of any amino acid, followed by FTG. Following is an explanation of the syntax for pattern specification. Implied Sets and Repeat Counts

Parentheses () enclose one or more symbols that can be repeated a certain number of times. Braces {} enclose numbers indicating how many times the symbols within the preceding parentheses must be found.

Sometimes, it is posssible to leave out part of an expression. If braces appear without preceding parentheses, the numbers in the braces define the number of repeats for the immediately preceding symbol. One or both of the numbers within the braces may be missing. For instance, both the pattern GASG{2,}F and the pattern GASG{2}F mean GAS, followed by G repeated from 2 to 350,000 times, followed by F; the pattern GASG{}F means GAS, followed by G repeated from 0 to 350,000 times, followed by F; the pattern GAS(TE){,2}F means GAS, followed by TE repeated from 0 to 2 times, followed by F; the pattern GAS(TE){2,2}F means GAS, followed by TE repeated exactly 2 times, followed by F (If the pattern in the parentheses is an OR expression (see below), it cannot be repeated more than 2,000 times). OR Matching

Specifying several symbol choices can be easily done by enclosing the different choices in parentheses and separating the choices with commas. For instance, RGF(Q,A)S means RGF followed by either Q or A followed by S. The length of each choice need not be the same, and there can be up to 31 different choices within each set of parentheses. The pattern GAT(TG,T,G){1,4}A means GAT followed by any combination of TG, T, or G from 1 to 4 times followed by A. The sequence GATTGGA matches this pattern. There can be several parentheses in a pattern, but parentheses cannot be nested. NOT Matching

The pattern GC~CAT means GC, followed by any symbol except C, followed by AT. The pattern GC~(A,T)CC means GC, followed by any symbol except A or T, followed by CC. Begin and End Constraints

The pattern <GACCAT can only be found if it occurs at the beginning of the sequence range being searched. Likewise, the pattern GACCAT> would only be found if it occurs at the end of the sequence range.

4.2.2 Control panel

Figure 4: The Control panel of the Search motif tool
Image search_controlpanel

The Control panel is limited to the motif entry box in which the pattern should be entered, the 'Search' button to launch the search, the 'Find Next' button to go to the next occurence of the motif, and the 'Return' button to leave the search tool.

When a motif is found, the background of the snapshot window will become black, and the motifs will be highlighted in red.

4.3 Sequence information browser and editor

All information attached to a protein that is not a feature can be viewed and/or edited in Ordalie. Depending on the origin of the alignment (fasta/msf/clustal or Macsim/ORD files) some fields may be empty.

When browsing or editing sequence information, a selector will appear at the top of the window to select the protein of interest. If this protein presents some unusual characteristics (unknown amino acids, the sequence corresponds to a fragment, ...) a red warning will appear on the left of the window.

4.3.1 Browsing information

The information is arranged in four frames.

4.3.2 Editing information

Some information can be edited to set them or to correct them. Editable fields are : 'sequence name', 'accession number', 'Bank Id', 'description', 'Organism', 'Taxa Id', 'life Domain' and 'E.C.'.

Image attention
Ordalie organizes the protein information using the protein sequence name as unique reference. Extra care should be taken when changing the sequence name of a protein.

The changes are applied as soon as the 'OK' button is pressed.

4.4 VRP

The VRP (Vectorial Representation of Protein) tool is a tool that may be used to define protein characteristics in a graphical manner. The protein sequence is here represented as the path of successive amino acids taken as vectors. The vectorial equivalence of each amino acid is given by a multidimentional scaling of the PAM250 similarity matrix [2].

4.4.1 The Drawing Area

Figure 5: The VRP window
Image vrp_window

When opened, the top part of the window displays the VRP of the first protein in the snapshot. Each dot corresponds to an amino acid, and clicking on a dot with <Button-1> display its name and position in the sequence. The VRP can be moved around by dragging the mouse with <Button-1> down. Dragging the mouse with <Button-3> down will zoom the drawing in and out.

4.4.2 The Control panel

On top of the Control panel is the sequence of the currently selected protein. If a dot has been picked in the drawing area, its corresponding residue will be displayed with a red background in the sequence window. By clicking on a residue in this sequence window, its corresponding dot will be displayed and labeled in red.

Sequence selection is done through the 'Sequence' combobox. At the top of the combobox there are items named 'All' and, if applicable, 'GroupX' where X is an integer indicating the group number. This allows the display of the VRP of the whole snapshot or of the groups if present. The group VRP is done by drawing, for each column of the snapshot, the average vector of the column scaled by the number of residues inside the column.

By checking the 'Overdraw' checkbox, the display is not cleaned between each VRP rendering, allowing the display of several VRPs at the same time.

The 'Feature' combobox will select a feature to be mapped onto the VRP drawing. No feature is mapped when dealing with a group. The 'Circle' button displays the amino acids vectors used to build the VRP, the 'Print' button creates a PNG image of the current VRP drawing and the 'Close' button closes the window.

4.5 Feature Editor

Features in Ordalie may come from the original alignment file (Macsims/XML or ORD files), from within Ordalie (residue conservation computation for example will create a new feature), loaded from the feature file format (see section 7.4) or defined by the user. This tool is dedicated to feature management.

4.5.1 Control panel

The Control panel of the 'Features Editor' is really simple.

Figure 6: The Control panel of the Feature Editor Mode
Image feateditor_controlpanel

It consists, from left to right, in :

4.5.2 Notation and Actions

It is important to understand the difference between a Feature and an Item of a Feature. Here, a Feature represents a set of instances of a given sequence characteristic that may be distributed over the whole snapshot. A Feature Item, or Item for short, is one instance of a Feature for a given sequence at a given place in the snapshot.

4.5.3 Contextual Menu

Contrary to all other tools, it is possible to interact directly with the features inside the snapshot window. A right click makes a contextual menu pop up, allowing several actions.

Image attention
In the 'Feature Editor' tool, action of <Button-1> is changed through key combination :
<Button-1> alone : the action applies to the sequence under the mouse pointer.
<Control + B1> : the action applies to the group the sequence pointed by the mouse belongs to.
<Shift + B1> : the action applies to all the sequences present in the snapshot. Select Item

Selects the Item just under the mouse pointer. If only <Button-1> is pressed, then the Item of the sequence will be selected, if <Control + B1> is pressed then all the Items at that position for sequences of the group will be selected, and if <Shift + B1> is pressed all Items appearing at that position for all sequences in the snapshot will be selected. Select All Items

Selects all Items of a sequence, a group of sequence or the whole snapshot depending on the key pressed.

Image ampoule2
Selecting all Items for all sequences means that the whole Feature is selected. If it is subsequently deleted, then the whole feature will be deleted. Select Region

A region, i.e. a residue range, can be selected by pressing down <Button-1> and then dragging the mouse along the sequence axis. Depending if no key, the <Control> or the <Shift> keys are held down meanwhile, the selected region will cover the current sequence, the group of sequence or all sequences respectively. The selected region can then be used to define a new Item. Clear Selection

Clears all selections currently set.

After having selected Items(s) or region, several option are then available. Edit Item

If the selection refers to one or several already existing Items, it is possible to change some of their properties: Define New ...

This option will make a window appear, allowing the description of the new item.

If the 'Feature Name' entry is filled with an already existing feature, then the new item will be added to the item list of that feature. If the 'Feature Name' does not exists, a new feature is then created. In all cases the user is supposed to give to the item at least a Color and optionally a Score and a Note. Delete selected Items

This will delete the selected items from the current feature. Note that if all Items of a Feature have been selected, then this option will deleted the Feature itself. Propagate Items to this group

The Selected Items will be propagated to all the sequences of the group they belong to. If an Item to be propagated is already present in one or more sequence of the group, the Item will not be propagated. Propagate Items to All

This will propagate the selected Items to all the sequences of the alignment. If an Item to be propagated is already present in one or more sequence of the group, the Item will not be propagated.

moumou 2019-03-25