GlycoCT format is encoding schema for carbohydrate sequences based on a connection table approach to describe carbohydrate sequences. The format is adopting IUPAC rules to generate a consistent, machine-readable nomenclature using a block concept to describe carbohydrate sequences like repeating units. It consists of two variants, a condensed format and an XML format. The condensed format allows for unique identification of glycan structures in a compact manner.
The monosaccharide naming convention follows the following format: a-bccc-DDD-e:f|g:h, where a is the anomeric configuration (one of a, b, o, x), b is the configurational symbol (one of d, l, x), ccc is the three-letter code for the monosaccharide as listed in Table 1.1, DDD is the base type or superclass indicating the number of consecutive carbon atoms such as HEX, PEN, NON, e and f indicate the carbon numbers involved in closing the ring, g is the position of the modifier, and h is the type of modifier. For a, b, e, f and g, an x can be used to specify an unknown value. bcc and g : h may also be repeated if necessary.
It is noted that substituents of monosaccharides are also treated as separate residues attached to the base residue. These substituents are distinguished by specifying one of the following codes immediately after the residue number: b=basetype, s=substituent, r=repeating unit, a=alternative unit. The list of substituents handled by GlycoCT is given in Table 1.2.
The GlycoCT format follows something similar to the KCF format, where the residues are specified in a RES section, and the linkage in a LIN section.
More details
TABLE 1.1: List of monosaccharide and their three-letter codes used in GlycoCT.
Monosaccharide name |
Three-letter code |
Superclass |
Allose |
ALL |
HEX |
Altrose |
ALT |
HEX |
Arabinose |
ARA |
PEN |
Erythrose |
ERY |
TET |
Galactose |
GAL |
HEX |
Glucose |
GLC |
HEX |
Glyceraldehyde |
GRO |
TRI |
Gulose |
GUL |
HEX |
Idose |
IDO |
HEX |
Lyxose |
LYX |
PEN |
Mannose |
MAN |
HEX |
Ribose |
RIB |
PEN |
Talose |
TAL |
HEX |
Threose |
TRE |
TET |
Xylose |
XYL |
PEN |
TABLE 1.2: List of substituents used in GlycoCT.
acetyl |
amidino |
amino |
anhydro |
bromo |
chloro |
diphospho |
epoxy |
ethanolamine |
ethyl |
fluoro |
formyl |
glycolyl |
hydroxymethyl |
imino |
iodo |
lactone |
methyl |
N-acetyl |
N-alanine |
N-amidino |
N-dimethyl |
N-formyl |
N-glycolyl |
N-methyl |
N-methyl-carbomoyl |
N-succinate |
N-sulfate |
N-triflouroacetyl |
nitrate |
phosphate |
phospho-choline |
phospho-ethanolamine |
pyrophosphate |
pyruvate |
succinate |
sulfate |
thio |
triphosphate |
(r)-1-hydroxyethyl |
(r)-carboxyethyl |
(r)-carboxymethyl |
(r)-lactate |
(r)-pyruvate |
(s)-1-hydroxyethyl |
(s)-carboxyethyl |
(s)-carboxymethyl |
(s)-lactate |
(s)-pyruvate |
(x)-lactate |
(x)-pyruvate |
Example of GlycoCT: The glycan containing repeating units in GlycoCT format.
RES
1r:r1
REP
REP1:8o(4+1)2d=-1--1
RES
2b:b-dgal-HEX-1:5
3s:n-acetyl
4b:b-dglc-HEX-1:5|6:a
5b:b-dgal-HEX-1:5
6s:n-acetyl
7b:a-dgal-HEX-1:5
8b:b-dglc-HEX-1:5
LIN
1:2d(2+1)3n
2:2o(3+1)4d
3:4o(4+1)5d
4:5d(2+1)6n
5:5o(4+1)7d
6:7o(3+1)8d
IUPAC suggests an extended IUPAC form by which structures are written across multiple lines. This is the format originally used by CarbBank, thus it is sometimes referred to as such. The representation of monosaccharides is the same as that of IUPAC format, where each monosaccharides residue is preceded by the anomeric descriptor and the configuration symbol and the ring size is indicated by an italic f or p. If any of α/β, D/L, f/p are omitted, it is assumed that this structural detail is unknown. Branches are written on a second line, or in brackets on the same line.
This format is may substitute α and β with a and b, respectively. Arrows (→) may also be replaced by hyphens (-)、and up (↑) and down (↓) arrows may be replaced by bars (|).
More details
Example of CarbBank format: The N-glycan core structure represented in CarbBank (extended IUPAC) format.
a-D-Manp-(1-6)+
|
b-D-Manp-(1-4)-b-D-GlcpNAc-(1-4)-a-D-GlcpNAc
|
a-D-Manp-(1-3)+
Linear Code® is a carbohydrate format that uses a single-letter nomenclature for monosaccharides and includes a condensed description of the glycosidic linkages. Monosaccharide representation is based on the common structure of a monosaccharide where modifications to the common structure are indicated by specific symbols, as in the following (Banin el al.(2002)).
Stereoisomers (D or L) differing from the common isomer are indicated by apostrophe (‘).
Monosaccharides with differing ring size (furanose or pyranose) from the common form are indicated by a caret (^).
Monosaccharides differing in both of the above are indicated by a tilde (~).
More details
TABLE 1.3 : List of common modifications as used in the Linear Code® format.
Modification Type |
Linear Code® |
amino |
Q |
ethanolaminephosphate |
PE |
inositol |
IN |
methyl |
ME |
N-acetyl |
N |
O-acetyl |
T |
phosphate |
P |
phosphocholine |
PC |
pyruvate |
PYR |
sulfate |
S |
sulfide |
SH |
2-aminoethylphosphonic acid |
EP |
Example of Linear Code®:
GNb2(Ab4GNb4)Ma3(Ab4GNb2(Fa3(Ab4)GNb6)Ma6)Mb4GNb4GN
The Bacterial Carbohydrate Structure DataBase(BCSDB) format is used in the BCSDB database to encode carbohydrates and derivative structures in a single line.
Residues are described in the format (-) where res is the name of the residue and its configuration and c1 and c2 correspond to the carbon numbers of the child and parent, respectively, by which the residue res is linked to its parent.
Of course the portion in parentheses is omitted for the residue at the root. If c1 or c2 are unknown, a question mark (?) may be used. If the glycan structure is a prpeated unit, then parts of the portions in parentheses may be hanging at the ends, such as in -2)A(1-3)B(1-4)C(1-, which represents the repeated structure linked by a 1-2 linkage. For branched structures, it is assumed that threre is only one main chain, and the rest are branches. Comma-separated side chanins are enclosed and specified in square brackets together with their linkage in parentheses.
More details
Example of BCSDB format: Chemical repeating unit of polymer.
a-D-GlcpA-(1-3)-+
|
-3)-a-D-Glcp-(1-4)-b-D-Manp-(1-4)-b-D-Glcp-(1-
The LInear Notation for Unique description of Carbohydrate Sequences (LINUCS) format is based on the extended IUPAC format but uses additional rules to define the priority of the branches. In this way, carbohydrate structure can be defined uniquely while still containing all the information required to describe the structure.
The start of LINUCS format may include two square brackets [], followed by the root residue name in square brackets. If a residue has a single child, then the child’s linkage in parentheses surrounded by square brackets precedes the child’s residue name and configuration (as in IUPAC format) in square brackets. If a residue has more than one child, then each child’s branch is surrounded by curly brackets {}. Children are listed in order of the carbon number linking them to the parent, such that the child with a 1-3 linkage would come before a child with a 1-4 linkage.
More details
Example of LINUCS: The glycan structure in LINUCS format.
[][Asn]{
[(4+1)][b-D-GlcpNAc]{
[(4+1)][b-D-GlcpNAc]{
[(4+1)][b-D-Manp]{
[(3+1)][a-D-Manp]{
[(2+1)][a-D-Manp]{
[(2+1)][a-D-Manp]{}
}
}
[(6+1)][a-D-Manp]{
[(3+1)][a-D-Manp]{
[(2+1)][a-D-Manp]{}
}
[(6+1)][a-D-Manp]{
[(2+1)][a-D-Manp]{}
}
}
}
}
}
}
The KEGG Chemical Function (KCF) format for representing glycan structures was originally used to represent chemical structures (thus the name) in KEGG. KCF uses the graph notation, where nodes are monosaccharides and edges are glycosidic linkages. Thus to represent a glycan, at least three sections are required: ENTRY, NODE, EDGE, followed by three slashes ‘///’ at the end.
More details
- The ENTRY section consists of one line and may specify a name for the structure followed by the keyword Glycan.
- The NODE section consists of several lines. The first line contains the number of monosaccharides or aglycon entities, and the following lines consist of the details of these entities numbered consecutively. For each entity line, the name and x- and y-coordinates (to draw on a 2D plane) must be specified.
- Similarly, the EDGE section consists of several lines, the first line containing the number of bonds (usually one less than the number of NODEs), followed by the details of the bond information. The format for the bond information is as follows:
num<donor node#>:<anomeric configuration (a or b)><donor carbon#> <acceptor node#>:<acceptor carbon#>
Example of KCF format: The N-glycan core structure represented in KCF format.
ENTRY XYZ Glycan
NODE 5
1 GlcNAc 15.0 7.0
2 GlcNAc 8.0 7.0
3 Man 1.0 7.0
4 Man -6.0 12.0
5 Man -6.0 2.0
EDGE 4
1 2:b1 1:4
2 3:b1 2:4
3 5:a1 3:3
4 4:a1 3:6
///
Web3 Unique Representation of Carbohydrate Structures (WURCS) as a linear notation for representing carbohydrates for the Semantic Web.
More details
WURCS=2.0/3,5,4/[a2122h-1b_1-5_2*NCC/3=O][a1122h-1b_1-5][a1122h-1a_1-5]/1-1-2-3-3/a4-b1_b4-c1_c3-d1_c6-e1