Tải bản đầy đủ (.pdf) (8 trang)

Using some bioinformatic tools to mining genes coding cellobiohydrolase from metagenome data of the bacteria surrounding white rot fungi (Trametes versicolor) in Cuc Phuong

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (4.35 MB, 8 trang )

SCIENTIFIC JOURNAL OF HANOI METROPOLITAN UNIVERSITY − VOL.62/2022

119

USING SOME BIOINFORMATIC TOOLS TO MINING GENES
CODING CELLOBIOHYDROLASE FROM METAGENOME DATA
OF THE BACTERIA SURROUNDING WHITE-ROT FUNGI
(Trametes versicolor) IN CUC PHUONG NATIONAL PARK
Nguyen Thi Binh1*, Le Thi Thu Hong2, Truong Nam Hai2
1

Hanoi Metropolitan University

2

Academy of Science and Technology, Vietnam Academy of Science and Technology

Abstract: Cellobiohydrolase (EC 3.2.1.91) is one of the important enzymes involved in
cellulose hydrolysis. In this study, the gene sequences encoding cellobiohydrolase were
extracted from the metagenome DNA data of microorganisms surrounding white-rot fungi in
Cuc Phuong National Park based on the KEGG database. 73 ORFs encoding
cellobiohydrolase were obtained, of which 15 ORFs contained complete genes, 6 ORFs with
functional regions. The expression level of the protein in E. coli was estimated by Periscope
software, which showed that the gene code GL0212614 had the highest expression level of 742
mg/l. The secondary and tertiary structures of GL0212614 were predicted by Phyre2, showing
that the structure of GL0212614 was determined based on c3nfvA template with 46% coverage
and 100% confidence In the secondary structure, there are 25% α helix, 29% β helix, 2% TM
helix and 14% no identify. GL0212614 is an acidic enzyme, the optimal temperature for
enzyme activity is 55°C-65°C. These results are an impotant basis in order to choose gene
expression conditions.
Keywords: Bioinformatics, cellobiohydrolase, DNA metagenome, E.coli, expression level.


Received 10 May 2022
Revised and accepted for publication 26 July 2022
(*) Email:

1. INTRODUCTION
Cellulose is one of the most important and popular biomass today. To effectively degrade
this biomass source, it is necessary to participate in cellulase enzymes: endo-1,4-β-D-glucanase
(endocellulase EC 3.2.1.4), exo-1,4-β-D- glucanase (exocellulase or cellobiohydrolase EC
3.2.1.91) and β-glucosidase (cellubiose hydrolase EC 3.2.1.21). Enzymes called endoglucanase
or endocellulase perform cleavage at random points within the cellulose, producing
oligosaccharides of variable size. Exocellulases or cellobiohydrolases act on the terminal ends


120

HANOI METROPOLITAN UNIVERSITY

of oligosaccharide chains produced by endocellulases, cleaving glycosidic bonds and releasing
glucose or cellobiose [1]. The enzyme β-glucosidase is responsible for breaking down
cellobiose into glucose molecules. Of these three groups of enzymes, cellobiohydrolase is an
important component of the cellulase system and plays a major role in biofuel production from
plant biomass [2]. Cellobiohydrolase is usually produced from fungi but also many bacteria that
contain the gene encoding this enzyme. Because microorganisms have a rather special
cellulosomal system, there are many different studies to study and exploit the gene encoding
cellobiohydrolase on this object such as the gene encoding cellobiohydrolase from Clostridium
clariflavum [2], the gene HmCel6A and variable its variant HmCel6A-3SNP from bacteria in
hot spring area [3], gene Cel6A from Penicillium [4]…
Soil is a potential ecosystem with abundant, diverse microorganisms. This is considered an
important source to search for new enzymes with high efficiency in cellulose degradation [5],
especially the surrounding white-rot fungi The white-rot fungi can effectively metabolize all

the components in the wood. This hydrolysis of white-rot fungi is often associated with
enzymes of bacteria living in the same ecosystem. In the process of fungi decomposing wood,
redox preaction had occurred that acidify the environment, in addition, the fungi are also
capable of producing the environment with secondary metabolic products. Therefore, bacteria
that survive in these conditions must have properties suitable for the environment.
To efficiently exploit genes from microorganisms in different ecosystems, metagenomics
techniques had been used to search for new genes from non-culturing microorganisms. Gene
sequencing yields very large metagenome data. To efficiently exploit these data, bioinformatics
tools were used to screen and predict candidate genes encoding for proteins of interest before
conducting experimental studies. In this study, we present how to use some bioinformatics tools
to mine new cellobiohydrolase enzyme genes from microbial metagenome DNA data
surrounding white-rot fungi in Cuc Phuong National Park.

2. MATERIALS AND METHODS
Resources: The 51.8 Gb metagenome DNA data of the microbial sample residing
surrounding the wood-hydrolyzed white-rot fungi (T. versicolor) in the Cuc Phuong rainforest
was sequenced using the HiSeqIllumina sequencing system ( Illumina, San Diego, USA) at
BGI, Hong Kong.
Research Methods
Prediction of ORFs using MetaGene Annotator (MGA) software: The 51.8 Gb
metagenome
DNA
data
were
sequenced,
using
IDBA
software
( to sequence the short sequences into 2,611,883
dimensional contigs The mean length was 898 bp and there were 4,104,872 ORFs identified

using the MGA software ( These ORFs
were then compared with the KEGG (Kyoto Encyclopedia of Genes and Genomes) data to find
the ORF sequences encoding cellobiohydrolase.


SCIENTIFIC JOURNAL OF HANOI METROPOLITAN UNIVERSITY − VOL.62/2022

121

Prediction of functional regions of ORF using PFAM and HHMER: Pfam
( is a database of a large collection of protein families and domains. To
predict the functional regions of ORF by Pfam, we provide protein sequences, using e-value
1.0 and provide a personal e-mail address, confirming submission via the HMMerwebsite, the
results will be returned after 2-3 days. HHMer is online software that allows the prediction of
functional regions of proteins in Pfam quickly, based on a representative HMM model
( ).
Prediction of protein expression level inferred from ORF using Periscope software:
Protein expression levels in E. coli cells were predicted using the Periscope software available
at Periscope classifies the expression levels of
soluble proteins into three levels: high, moderate, low, in addition to a predictive function of
the amount of soluble protein in mg/l.
Predicting
the
spatial
structure
of
proteins:
Phyre2
software
( ?id=index) was used. To predict the

higher-order structure of a protein, the user submits the protein sequence to determine the
secondary and tertiary structure of the models, domain composition, and model quality of the
protein. Typical structure prediction results will be returned to the sender's e-mail.
Prediction of some physical properties of proteins: Use the AcalPred software at
to predict the acidic or alkaline proteins. Users enter the
target protein sequence into the search box, the software will return results on the acid- or
alkaline-protein in a few minutes. TBI software ( was used to
predict the optimum temperature of enzyme activity. The inputs to TBI are the amino acid
sequences and the results will be available in a few minutes.

3. RESULTS AND DISCUSSION
3.1. Prediction of ORFs encoding the enzyme cellobiohydrolase
Based on the KEGG database and using the MGA software, 73 ORFs were predicted
encoding cellobiohydrolase. In which, 15 ORFs (20.55%) contain the entire gene (complete
gene), the remaining 11 ORFs lack the 3' end, 5 ORFs lack the 5' end and 42 ORFs lack both 5'
and 3' ends. In the data analysis of genes encoding cellobiohydrolase enzymes, we prioritized
to select complete ORFs for further analysis
3.2. Analysis of functional regions of ORF
Proteins usually consist of one or more functional regions called domains. Therefore,
searching of domains presented in proteins provided insights into their function. To evaluate
the function of enzymes, we conducted the domains of 15 complete ORFs. Of which, there were
6 ORFs with functional domains: 1 ORF has Alginate_lyase domain, 1 ORF has Amidase 3
domain, 1 ORF has CBM2 domain, 1 ORF has CBP_BcsO domain, 1 ORF has GH128 +
Laminin G3 domain, 1 ORF has domain Znribbon 8 (domains of genes are shown in Figure 1).
These domains were involved in the function of genes. Therefore, in the next prediction we will
proceed on 6 complete ORFs with defined functional regions.


122


HANOI METROPOLITAN UNIVERSITY

Figure 1. Diagram showing the functional domain of ORFs.
3.3. Prediction of expression levels of genes encoding the enzyme cellobiohydrolase
E. coli is considered to be the most popular recombinant protein expression system today.
Expression of E. coli soluble proteins not only purified target proteins, but also enhanced the
ability to obtain structurally intact and biologically active proteins. The expression level of
soluble proteins was determined by Periscope software. 6 complete ORFs, which were
identified the functional regions, were expressed in E. coli. The results of predicting the
expression levels of 6 ORFs were shown in Table 1.
Table 1. Predicted expression level of cellobiohydrolase gene in E. coli
No
1
2
3
4
5
6

Gene code
GL0212614
GL0221923
GL2034110
GL0879211
GL0058533
GL0733968

Domain
Alginate_lyase
Amidase3

CBM2
CBP_BcsO
GH128+ Laminin G3
Znribbon

Expression level (mg/l)
742,5445
9,2954
9,1483
15,3828
13,5843
0,1945

The results of expression showed that the gene code GL0212614 containing the
Alginate_lyase domain had the highest expression level 742 mg/l. The remaining gene codes
all had low expression levels, which will be difficult for further expression studies. Therefore,
the gene GL0212614 was selected to estimate the properties before further experiments. The
gene sequences and amino acid sequences in GL0212614 are shown in Figure 2.
atg aaa gta att gtt ttc ctg att tta atg gtg gtt cta aac agc tgt tct ttg gct ttt
M K V I V F L I L M V V L N S C S L A F
gcc caa tca ttt gtt cat ccg ggt gga tta cat acc ctc gcc gac tta aac cga atg aaa
A Q S F V H P G G L H T L A D L N R M K


SCIENTIFIC JOURNAL OF HANOI METROPOLITAN UNIVERSITY − VOL.62/2022

gat atg gtg aag aag cgg gcg cat cca tgg ata gac agt tgg aac aaa ctt atc caa gat
D M V K K R A H P W I D S W N K L I Q D
cca ctt gca caa aac acc tat aca gct gca ccc aag gca aat atg ggc gat agt cgg cag
P L A Q N T Y T A A P K A N M G D S R Q

cgt gca tca acc gat gcg cac gcg gct tat ttg aat gcc ata cgc tgg tac atc aca ggt
R A S T D A H A A Y L N A I R W Y I T G
gat cgc agt tat ggg gat tgt gcg att tcc atc tgt aac gca tgg tcc ggc acc gtt gat
D R S Y G D C A I S I C N A W S G T V D
cga gtg cca tca ggt gta gac att ccc gga ctg agt gga atc gct atc gct gag ttt gca
R V P S G V D I P G L S G I A I A E F A
ttg gcc gca gaa gta ctt cgg ctg aat gaa cgg tgg gaa atc gat gaa att agg cgt ttt
L A A E V L R L N E R W E I D E I R R F
aaa acc atg atg act acc tat ttt tat ccg gtt tgc cat gat ttc ttg acg aac cat gct
K T M M T T Y F Y P V C H D F L T N H A
gga agg tgt gcc gat tat ttt tgg gca aac tgg gat gcc tgt aat ata gct gca tta att
G R C A D Y F W A N W D A C N I A A L I
gct atg ggt gta ctt tgc gat gat cgg aat att tat gac gaa gga gtt gaa tat ttt aaa
A M G V L C D D R N I Y D E G V E Y F K
cac gga gat ggc gcc ggc agc atc gaa cac gcc gtt gcc tac att cat tcc ggt aat ctc
H G D G A G S I E H A V A Y I H S G N L
ggg caa tgg cag gaa agc ggc agg gat cag gaa cat gca cag tta gga gtg gga ctt ttg
G Q W Q E S G R D Q E H A Q L G V G L L
gct gca gcc tgt cag gtt gcg tgg aat cag gga ttg gac cta ttc agt tat gat aat aac
A A A C Q V A W N Q G L D L F S Y D N N
cgg ctt ctt gct ggt gcc gaa tat gta gca aaa tat aac cta tgg cag gat gta cct ttt
R L L A G A E Y V A K Y N L W Q D V P F
aaa tat tat aac agc tgc cag cag gta aac cat aat tgg tca tct att aat gga agg gga
K Y Y N S C Q Q V N H N W S S I N G R G
agg ttg gat gat cgc ccg ctt tgg gag tta att tac aat cat tat gtc gtt aga aaa agg
R L D D R P L W E L I Y N H Y V V R K R
ttg aac gca cct aat tca aaa tta atg gct gaa ctc atg aga ccc gag cat ggc agt aac
L N A P N S K L M A E L M R P E H G S N
gat cat ttt gga tac ggt aca ctg aca ttt acg ttg gat gga aag cag tca ccc tat cct


123


124

HANOI METROPOLITAN UNIVERSITY

D H F G Y G T L T F T L D G K Q S P Y P
gca ctt gca aca cca gcc att ccg acc cat ctg act gct aca gca ggt gta aat aga gta
A L A T P A I P T H L T A T A G V N R V
tat ctc aca tgg cat cca tct gaa gga tat act gcg cag gga tat gag gtg caa cgg gct
Y L T W H P S E G Y T A Q G Y E V Q R A
ata agt agc gcc ggt cct tat aac atc att acc aaa tgg aat gat cat aca tca cca caa
I S S A G P Y N I I T K W N D H T S P Q
tat ata gat ccg gat gta aca aat gga aca aat tac tac tac cgg gtg gcg gca ttg aac
Y I D P D V T N G T N Y Y Y R V A A L N
caa tca ggt act agt tcg tat tct tcc att gtc cag gcc agt cct cag gct gca gga gaa
Q S G T S S Y S S I V Q A S P Q A A G E
ctt cct gcg aaa tgg aaa aat aca tta atc ggg aaa gga aat gat ggc aat gcc gct ttt
L P A K W K N T L I G K G N D G N A A F
gct gcc gtt ggc gaa gga acc ttt att gtt aaa gga aac gga act gat ctc gga gga aat
A A V G E G T F I V K G N G T D L G G N
gaa gat caa ata acc tat act tac tgt cgt gta gaa gga gat ttt gtg atc acc gca aga
E D Q I T Y T Y C R V E G D F V I T A R
att tcg gat att act ggg cct aat cag aaa aca ggg ata atg gtt agg gaa tcg ctg gct
I S D I T G P N Q K T G I M V R E S L A
gca gac gcg aaa gca gtg agc ata acc ttg gga gat gca ggc gga cgt ttt gcc cga atg
A D A K A V S I T L G D A G G R F A R M
ggc aaa cgt aaa aat gac aaa gaa aaa atg tct ttt aca ttg gga aac gct tat aca tgg
G K R K N D K E K M S F T L G N A Y T W

ttg ccg gcg tgg ttc agg tta gaa cgg act gga agc tct tat aaa gca ttt gaa tct tcc
L P A W F R L E R T G S S Y K A F E S S
gat ggg acg cat tgg ttt aag gtt tct act gaa aac ttc agc atg tca aaa aca gca ttt
D G T H W F K V S T E N F S M S K T A F
gtc gga ttg gtt gtt gct tca ggt agt gcg tca gga ata gat act gtc acc ttc gat cat
V G L V V A S G S A S G I D T V T F D H
gta aag atc acc aaa agt act aat tct ggc aaa caa ggc gaa tga
V K I T K S T N S G K Q G E Figure 2. Gene sequence and amino acid sequence of the gene GL0212614


SCIENTIFIC JOURNAL OF HANOI METROPOLITAN UNIVERSITY − VOL.62/2022

125

3.4. Predicting the spatial structure and some physical properties of proteins
Since the structure of proteins tended to be more conservative than their amino acid
sequences during evolution, we proceed to predict the spatial structure of the GL0212614 gene
using the Phyre2 software. The results showed that the spatial structure model of the
GL0212614 gene determined based on the alginate lyase enzyme c3nfvA_ template from
Bacteroides2 ovatus had a coverage of 46% and a confidence level of 100% (Figure 3). In the
secondary structure, there was 25% α helix, 29% β helix, 2% TM helix and 14% unidentified.

Figure 3. Structural model of the GL0212614 gene using Phyre2
Some physical properties of the GL0212614 gene were also predicted. When inserting the
amino acid sequence into AcalPred software, the results of acidic and alkaline index were
0.919904 and 0.080096, respectively. According to the prediction of this tool, the gene of
choice is an acidic enzyme. This result is consistent with previous studies showing that genes
are active under acidic pH conditions [4]. The optimal temperature for enzyme activities
according to TBI had 3 levels: above 65°C, 55°C-60°C, below 55°C. The results of melting
temperature (Tm) of GL0212614 had a Tm of 0.8289, so the optimal temperature for enzyme

activity is 55°C-60°C. This result will help us to choose suitable temperature and pH conditions
in future studies.

4. CONCLUSION
We had exploited 15 complete genes encoding cellobiohydrolase for microbial
metagenome data surrounding white-rot fungi in Cuc Phuong National Park. In which, 6 genes
had functional regions. Gene code GL0212614 with functional region Alginate_lyase was the
gene with the highest expression level of 742 mg/l. GL0212614 was structurally determined
based on the alginate lyase c3nfvA_ enzyme template from Bacteroides2 ovatus with 46%
coverage and 100% confidence. In the secondary structure, there were 25% α helix, 29% β
helix, 2% TM helix and 14% unidentified. GL0212614 is an acidic protein, the optimal
temperature for enzyme activity is 55°C-65°C. These results are an important basis in order to
choose gene expression conditions


126

HANOI METROPOLITAN UNIVERSITY

Acknowledgments: This study was supported by the grant from the Bilaterial
International Project, code: NĐT.50.GER/18, from Ministry of Science and Technology
(MOST), Vietnam and Federal Ministry of Education and Research, Germany; using the
facilities in National Key Laboratory of Gene Technology, Institute of Biotechnology, Vietnam
Academy of Science and Technology (VAST), Vietnam.
REFERENCES
1. F. L. Soares Júnior et al. (2013), “Endo- and exoglucanase activities in bacteria from mangrove
sediment,” Brazilian J. Microbiol., vol. 44, no. 3, p. 969, doi: 10.1590/S1517-83822013000300048.
2. A. Zafar et al. (2021), “Efficient biomass saccharification using a novel cellobiohydrolase from
Clostridium clariflavum for utilization in biofuel industry,” RSC Adv., vol. 11, no. 16, pp. 9246–
9261, Mar. 2021, doi: 10.1039/D1RA00545F.

3. M. Takeda et al. (2022), “Metagenomic mining and structure-function studies of a hyperthermostable cellobiohydrolase from hot spring sediment,” Commun. Biol. 2022 51, vol. 5, no. 1, pp.
1–11, Mar. 2022, doi: 10.1038/s42003-022-03195-1.
4. L. Gao, F. Wang, F. Gao, L. Wang, J. Zhao, and Y. Qu (2011), “Purification and characterization of
a novel cellobiohydrolase (PdCel6A) from Penicillium decumbens JU-A10 for bioethanol
production,” Bioresour. Technol., vol. 102, no. 17, pp. 8339–8342, Sep. 2011, doi:
10.1016/J.BIORTECH.2011.06.033.
5. T.-T.-H. Le et al. (2022), “De Novo Metagenomic Analysis of Microbial Community Contributing
in Lignocellulose Degradation in Humus Samples Harvested from Cuc Phuong Tropical Forest in
Vietnam,” Divers. 2022, Vol. 14, Page 220, vol. 14, no. 3, p. 220, Mar. 2022, doi:
10.3390/D14030220.

SỬ DỤNG MỘT SỐ CÔNG CỤ TIN SINH ĐỂ KHAI THÁC GEN
MÃ HÓA ENZYME CELLOBIOHYDROLASE TỪ DỮ LIỆU
METAGENOME CỦA KHU HỆ VI KHUẨN QUANH NẤM MỤC
TRẮNG (Trametes versicolor) Ở VƯỜN QUỐC GIA CÚC PHƯƠNG
Tóm tắt: Cellobiohydrolase (EC 3.2.1.91) là một trong những enzyme quan trọng tham gia
vào quá trình thủy phân cellulose. Trong nghiên cứu này, trình tự gen mã hóa
cellobiohydrolase đã được khai thác từ dữ liệu DNA metagenome của vi sinh vật quanh khu
nấm mục trắng ở vườn Quốc gia Cúc Phương dựa trên cơ sở dữ liệu KEGG. Có 73 ORF mã
hóa enzyme cellobiohydrolase được thu nhận, trong đó có 15 ORF chứa gen hồn thiện, có 6
ORF có các vùng chức năng. Mức độ biểu hiện của protein trong E. coli được ước đoán bằng
phần mềm Periscope cho thấy mã gen GL0212614 có mức độ biểu hiện cao nhất là 742 mg/l.
Cấu trúc bậc hai và bậc ba của GL0494307 được dự đốn bằng Phyre2 cho thấy, GL0212614
có cấu trúc được xác định dựa trên khn c3nfvA có độ bao phủ 46% và độ tin cậy 100%.
Trong cấu trúc bậc 2 của gen GL0212614 có 25% xoắn α, 29% xoắn β, 2% xoắn TM và 14%
không xác định. GL0212614 là gen chịu axit, nhiệt độ tối ưu cho hoạt tính của enzyme là 55°C65oC. Những kết quả này là cở sở quan trong để lựa chọn được các điều kiện biểu hiện gen.
Từ khóa: Cellobiohydrolase, DNA metagenome, E.coli, mức độ biểu hiện, tin sinh học.




×