Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

I need this done ASAP please. Thank you. Problem 2 (30 points) Sequence Databases and BLAST NCBI is one of the largest and most comprehensive

I need this done ASAP please. Thank you.

Problem 2 (30 points) Sequence Databases and BLAST

NCBI is one of the largest and most comprehensive databases belonging to the NIH National Institutes of Health (USA). Entrez is the search engine of NCBI, and can be accessed at http: //www.ncbi.nlm.nih.gov/. You can use it to search for genes, proteins, genomes, publications and much more. To limit the results returned, you can limit your query to a particular database, and/or combine your query terms with field qualifiers and Boolean operators (AND, OR, NOT). See the help page at http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices. html#Search_Fields_and_Qualifiers for all field qualifiers. Even with these qualifiers, you may still get a lot hits, as some of the database entries are highly redundant, representing essentially the same sequence with different identifiers. For this reason, NCBI has created a sub-database, RefSeq, which contains only non-redundant, highly annotated entries for genomic DNA, transcript (mRNA), and protein sequences.

A). Search the Protein database to find the sequence for a human gene called CD4, using the search string CD4 [GENE] and Homo sapiens [ORGN], where [GENE] and [ORGN] are field qualifiers, and Homo sapiens is the scientific name for human. You should see more than 10 entries. Using the filters on the left column, limit your search results to RefSeq. You will see 6 entries, corresponding to various isoforms of this protein. (Check Wikipedia page on protein isoforms: http://en.wikipedia.org/wiki/Protein isoform). Click to view details of the longest isoform, isoform 1 precursor. The sequence is displayed in a format called GenBank (or GenPept for protein), with annotations (features) appearing before the actual sequence. For some explanation of the format, see http://www.ncbi.nlm.nih.gov/ Sitemap/samplerecord.html. What is the accession number (a unique identifier) of this sequence? How many amino acids does this protein have? What is the first five amino acids of this protein? Find out how to change the display to FASTA format, which is one of the simplest and most popular formats. (Youve seen this in HW1.) Save the sequence in FASTA format to a text file.

B). Go to NCBI homepage and find the link to the BLAST web-page, and choose the protein blast program. Copy the human CD4 protein sequence (or just its accession number) you just saved to the query window. Pick the RefSeq protein database as the search set. Down at the very bottom, click on Algorithm Parameters, change Max target sequences to the max value, Expect threshold to 1, and the scoring matrix to BLOSUM45, and run blast. Which five organisms have protein sequences that are most similar to human CD4? Google to find out the common names of the organisms. Among all the hits, can you find one sequence from the chicken? (The scientific name of chicken is Gallus Gallus). Use the chicken-human alignment for the next question.

C). Above the graphic summary section on the result page, you can find a link to search summary, which shows some statistical parameters used to compute the significance of the alignment. In particular, you can see Lambda and K for gapped alignment (second column), and the size of the database. (It used to provide effective lengths of query and database, but that function has been removed, unfortunately.) Use these numbers, and the chicken-human CD4 alignment to show (1) how to get the bit score from the raw score; and (2) how to compute the E-value of an alignment using both the bit score and the raw score. (Because the effective lengths are not known, the E-value you computed will only be an approximation of what is shown by blast).

D). Go back to the BLAST homepage, under BLAST Genomes, click Human. Copy the human CD4 sequence you just saved to the query window. Select RefSeq RNA as the database. Choose an appropriate program (among the five programs shown at the top of the page, i.e., blastn, blastp, blastx, etc.) to align the protein sequence to the human reference RNA sequences. The top hit should be the corresponding reference mRNA sequence of this protein. What is the accession number of this reference sequence? How many exons are in the human CD4 gene?

E). Go back to the BLAST homepage, choose the nucleotide blast program. Paste the accession number of reference mRNA sequence you obtained in (D) to the query window. Change database to refseq rna. On the very bottom of the page, click on Algorithm parameters. Record the following parameters used by the program: word size, Match/Mismatch Scores, Gap costs. Run BLAST. How many hits are found? Save the following information about the 2 least significant hit: sequence accession number, score, E-value, alignment length, percent of identities, and percent of gaps.

F). Repeat the experiment in (E), but change Program Selection to optimize for somewhat similar sequence. Click on Algorithm parameters and compare the parameters with the ones you recorded in (E). Explain the difference. Run BLAST. How many hits are found this time? Is the least significant hit you found in (E) still in your result? If yes, compare its score, E-value, length, percent identities and gaps to the result in (E) and explain the difference.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Databases Demystified

Authors: Andrew Oppel

1st Edition

0072253649, 9780072253641

More Books

Students also viewed these Databases questions