SEA-PHAGES | Difficulty pulling large-scale data for batch GCS analysis using Pope/Mavrich 2017 scripts

Link to this post \| posted 09 Mar, 2022 01:04
afreise	Hello, I am attempting to do a large-scale GCS comparison of all the phages in Cluster AZ (and plan to do this on other phages as well in the future). My former TA Andrew Kapinos wrote up a set of instructions to accompany the github code for the Pope & Mavrich 2017 batch GCS/MaxGCDGap scripts. I've been able to set up mySQL and other necessary components, and successfully pulled some data, but am stuck again. When I do the data pull from the Actino_Draft database using mySQL, I can't seem to pull the pham and function data that the GCS scripts require for the function.csv file, according to the instructions I have (I left comments in that file to highlight where I got stuck). Can anyone make a recommendation of what I might try next? Kudos to Nick for the help he's provided so far - thought I'd ask here as well so I'm not pestering him too much! (thanks Nick!) Best, Amanda

Link to this post | posted 09 Mar, 2022 01:04

Hello,

I am attempting to do a large-scale GCS comparison of all the phages in Cluster AZ (and plan to do this on other phages as well in the future). My former TA Andrew Kapinos wrote up a set of instructions to accompany the github code for the Pope & Mavrich 2017 batch GCS/MaxGCDGap scripts.

I've been able to set up mySQL and other necessary components, and successfully pulled some data, but am stuck again.

When I do the data pull from the Actino_Draft database using mySQL, I can't seem to pull the pham and function data that the GCS scripts require for the function.csv file, according to the instructions I have (I left comments in that file to highlight where I got stuck). Can anyone make a recommendation of what I might try next?

Kudos to Nick for the help he's provided so far - thought I'd ask here as well so I'm not pestering him too much! (thanks Nick!)

Best,
Amanda

Link to this post \| posted 09 Mar, 2022 15:56
afreise	Update: I've generated the files successfully! I think now I just need to pare down in excel to the only phages I want to compare. Right now gene.csv has all 4000+ phages and genes and I think my computer will explode if I run the GCS script on all of them. Even though I've only listed my subset of AZ phages in genome.csv, I should still need to pare down gene.csv to those phages too, right?

Link to this post \| posted 09 Mar, 2022 16:01
chg60	Hi Amanda, The database schema has changed substantially since that publication, so it's likely the script(s) will need to be altered to accomplish your goal. Since you already have MySQL (and presumably have a local copy of Actino_Draft loaded into MySQL… ), there two paths forward: the easy way and the hard way. I'll lay out the easy way first, and if you like that we don't have to think about the hard way! I'm assuming that your computer runs macOS or Linux. In a Terminal, run these commands: 1. mysql -u root -p Actino_Draft -e "SELECT PhageID, PhamID FROM gene WHERE PhageID IN (SELECT PhageID FROM phage WHERE Cluster = 'AZ' ) ORDER BY PhageID, Stop ASC;" -B -N >> ~/cluster_AZ_phams.tsv 2. pip3 install phamclust 3. phamclust ~/cluster_AZ_phams.tsv ~/cluster_AZ_gcs -m gcs Cluster AZ only has 37 members, so all of this should be done very quickly. Now in your Finder, you can go to the home directory –> cluster_AZ_gcs, and open cluster_1.txt in Excel. Select the first column and split text to columns using delimiters "tab" and "space". You can then apply a color gradient (I like a three-step: red=0, white=35, green=100) to easily visualize the strength of pairwise relationships. Explanation of Terminal steps: 1. exporting PhageIDs and their associated PhamIDs (in ascending order of gene stop coordinate) to a tab-delimited table with no header 2. installing my phamclust package 3. run phamclust using the gcs method ('-m gcs' ) against the table we exported, with results being placed in a directory called 'cluster_AZ_gcs' Edited 09 Mar, 2022 16:03 17Kb

Link to this post | posted 09 Mar, 2022 16:01

chg60

Hi Amanda,

The database schema has changed substantially since that publication, so it's likely the script(s) will need to be altered to accomplish your goal.

Since you already have MySQL (and presumably have a local copy of Actino_Draft loaded into MySQL… ), there two paths forward: the easy way and the hard way. I'll lay out the easy way first, and if you like that we don't have to think about the hard way! I'm assuming that your computer runs macOS or Linux.

In a Terminal, run these commands:

1. mysql -u root -p Actino_Draft -e "SELECT PhageID, PhamID FROM gene WHERE PhageID IN (SELECT PhageID FROM phage WHERE Cluster = 'AZ' ) ORDER BY PhageID, Stop ASC;" -B -N >> ~/cluster_AZ_phams.tsv
2. pip3 install phamclust
3. phamclust ~/cluster_AZ_phams.tsv ~/cluster_AZ_gcs -m gcs

Cluster AZ only has 37 members, so all of this should be done very quickly.

Now in your Finder, you can go to the home directory –> cluster_AZ_gcs, and open cluster_1.txt in Excel. Select the first column and split text to columns using delimiters "tab" and "space". You can then apply a color gradient (I like a three-step: red=0, white=35, green=100) to easily visualize the strength of pairwise relationships.

Explanation of Terminal steps:
1. exporting PhageIDs and their associated PhamIDs (in ascending order of gene stop coordinate) to a tab-delimited table with no header
2. installing my phamclust package
3. run phamclust using the gcs method ('-m gcs' ) against the table we exported, with results being placed in a directory called 'cluster_AZ_gcs'

Edited 09 Mar, 2022 16:03

Link to this post \| posted 09 Mar, 2022 17:32
afreise	WOW - thank you SO MUCH, Christian! It worked perfectly! Definitely easier than what I was attempting. A couple more questions, if/when you have a moment - -Is there a way to do the same analysis but include draft genomes? I want to do a preliminary analysis including drafts just to see where some of the newer AZ phages fall that haven't been manually annotated yet (not for publication, but just out of curiosity) -If I wanted to compare phages from multiple clusters, is that possible using an adjustment to the above script? For example, we are interested in including singleton phage BaileyBlu to the AZ phages, and/or other clusters such as EH.

Link to this post | posted 09 Mar, 2022 17:32

afreise

WOW - thank you SO MUCH, Christian! It worked perfectly! Definitely easier than what I was attempting. smile

A couple more questions, if/when you have a moment -

-Is there a way to do the same analysis but include draft genomes? I want to do a preliminary analysis including drafts just to see where some of the newer AZ phages fall that haven't been manually annotated yet (not for publication, but just out of curiosity)

-If I wanted to compare phages from multiple clusters, is that possible using an adjustment to the above script? For example, we are interested in including singleton phage BaileyBlu to the AZ phages, and/or other clusters such as EH.

Link to this post \| posted 09 Mar, 2022 17:49
chg60	Excellent - glad that worked for you! You can do the same analysis with any subset of genomes from Actino_Draft, the only thing that needs to change is command #1 from above, then run command #3 since you already have phamclust installed. To analyze the things currently labelled as AZ against everything currently in draft status, you'd run something like: mysql -u root -p Actino_Draft -e "SELECT PhageID, PhamID FROM gene WHERE PhageID IN (SELECT PhageID FROM phage WHERE Cluster = 'AZ' or Status = 'draft' ) ORDER BY PhageID, Stop ASC;" -B -N >> ~/cluster_AZ_or_draft_phams.tsv To do the AZ-EH comparison, you'd do something like: mysql -u root -p Actino_Draft -e "SELECT PhageID, PhamID FROM gene WHERE PhageID IN (SELECT PhageID FROM phage WHERE Cluster IN ('AZ', 'EH' ) ORDER BY PhageID, Stop ASC;" -B -N >> ~/cluster_AZ_or_EH_phams.tsv You're basically limited in how creatively you can construct your initial MySQL query. Phamclust's default behavior builds clusters with the 35% GCS threshold and then outputs a single pairwise matrix for each resultant cluster. You can force the creation of an all-versus-all matrix (for example if you are comparing things that are more-or-less unrelated) by adding '-t 0' to command #3. -Christian

Link to this post | posted 09 Mar, 2022 17:49

chg60

Excellent - glad that worked for you!

You can do the same analysis with any subset of genomes from Actino_Draft, the only thing that needs to change is command #1 from above, then run command #3 since you already have phamclust installed.

To analyze the things currently labelled as AZ against everything currently in draft status, you'd run something like:

mysql -u root -p Actino_Draft -e "SELECT PhageID, PhamID FROM gene WHERE PhageID IN (SELECT PhageID FROM phage WHERE Cluster = 'AZ' or Status = 'draft' ) ORDER BY PhageID, Stop ASC;" -B -N >> ~/cluster_AZ_or_draft_phams.tsv

To do the AZ-EH comparison, you'd do something like:

mysql -u root -p Actino_Draft -e "SELECT PhageID, PhamID FROM gene WHERE PhageID IN (SELECT PhageID FROM phage WHERE Cluster IN ('AZ', 'EH' ) ORDER BY PhageID, Stop ASC;" -B -N >> ~/cluster_AZ_or_EH_phams.tsv

You're basically limited in how creatively you can construct your initial MySQL query.

Phamclust's default behavior builds clusters with the 35% GCS threshold and then outputs a single pairwise matrix for each resultant cluster. You can force the creation of an all-versus-all matrix (for example if you are comparing things that are more-or-less unrelated) by adding '-t 0' to command #3.

-Christian

Link to this post \| posted 09 Mar, 2022 17:50
afreise	Oh, I just realized this data pull does include draft genomes - so my first question is answered.

Link to this post \| posted 11 Mar, 2022 19:02
edgingtonn1	Hi Christian, Using a slight modification of your MySQL query statement, I pulled the data for all HostGenus = Arthrobacter and Microbacteria, and then tried the `phamclust -t 0 -m gcs` "-t 0" flag (to set the similarity threshold for a genome to join a cluster to zero). I thought it would produce a single file with an all-by-all matrix, but instead it actually produced even more cluster matrix files than when I used the default "-t 35". For both, I used the same input file. Did I do something wrong? or am I misinterpreting "-t=0"? P.S. Setting it to 100% is interesting. There are seven phamclust 'clusters' (in these phages from these two hosts, wherein all of the members have 100% GSC with each other! Best, -Nick Cohort IV Southern Connecticut State University 501 Crescent St. Department of Biology, SCI 311E New Haven, CT 06515

Link to this post | posted 11 Mar, 2022 19:02

edgingtonn1

Hi Christian,

Using a slight modification of your MySQL query statement, I pulled the data for all HostGenus = Arthrobacter and Microbacteria, and then tried the

phamclust -t 0 -m gcs

"-t 0" flag (to set the similarity threshold for a genome to join a cluster to zero). I thought it would produce a single file with an all-by-all matrix, but instead it actually produced even more cluster matrix files than when I used the default "-t 35". For both, I used the same input file.
Did I do something wrong? or am I misinterpreting "-t=0"?

P.S. Setting it to 100% is interesting. There are seven phamclust 'clusters' (in these phages from these two hosts, wherein all of the members have 100% GSC with each other!

Best,
-Nick

Cohort IV
Southern Connecticut State University
501 Crescent St.
Department of Biology, SCI 311E
New Haven, CT 06515

Link to this post \| posted 11 Mar, 2022 21:43
chg60	Hi Nick, Your output using '-t 0' probably indicates that there's a bug that I need to work out. Setting a clustering threshold of 0% similarity should result in one massive cluster, regardless of the clustering algorithm… I'll look into it and get back to you. Your '-t 100' experiment is interesting! I knew there were some groups of VERY homogeneous phages, but didn't think there were 7 groups of 100 percenters just within that host group! I'd be curious the results if you don't use GCS, but rather the default 'pocp' method of calculating similarity. The 'pocp' math is very similar to GCS, but deals with the (rare) instances of paralogs more sensibly than GCS. Best, Christian Edited 11 Mar, 2022 21:44

Link to this post | posted 11 Mar, 2022 21:43

chg60

Hi Nick,

Your output using '-t 0' probably indicates that there's a bug that I need to work out. Setting a clustering threshold of 0% similarity should result in one massive cluster, regardless of the clustering algorithm… I'll look into it and get back to you.

Your '-t 100' experiment is interesting! I knew there were some groups of VERY homogeneous phages, but didn't think there were 7 groups of 100 percenters just within that host group! I'd be curious the results if you don't use GCS, but rather the default 'pocp' method of calculating similarity. The 'pocp' math is very similar to GCS, but deals with the (rare) instances of paralogs more sensibly than GCS.

Best,

Christian

Edited 11 Mar, 2022 21:44

Link to this post \| posted 11 Mar, 2022 22:24
chg60	Ok, I found the bug. The clustering algorithm I wrote doesn't require that you provide an explicit clustering threshold - if you don't provide one, it tries to infer one by studying the matrix. However, in the if-block that checks whether the function was given a threshold, I didn't anticipate the need to differentiate between 0 and None (`if not threshold:`). If you run: pip install –upgrade phamclust that should bring phamclust up to version 0.1.2, which fixes the issue. Thanks for reporting that unexpected behavior, Nick! -Christian

Link to this post | posted 11 Mar, 2022 22:24

chg60

Ok, I found the bug. The clustering algorithm I wrote doesn't require that you provide an explicit clustering threshold - if you don't provide one, it tries to infer one by studying the matrix. However, in the if-block that checks whether the function was given a threshold, I didn't anticipate the need to differentiate between 0 and None (`if not threshold:`).

If you run:

pip install –upgrade phamclust

that should bring phamclust up to version 0.1.2, which fixes the issue.

Thanks for reporting that unexpected behavior, Nick!

-Christian

Link to this post \| posted 10 May, 2022 17:02
afreise	Hi Christian, I think I may be running into the same bug (?) when I try to compare multiple clusters. Right now I'm trying to compare between clusters CA, CB, CC and CE. when I run the following commands, 1. mysql -u root -p Actino_Draft -e "SELECT PhageID, PhamID FROM gene WHERE PhageID IN (SELECT PhageID FROM phage WHERE Cluster IN ('CA', 'CB', 'CC', 'CE' )) ORDER BY PhageID, Stop ASC;" -B -N >> ~/Rhodo_phams.tsv 2. phamclust ~/Rhodo_phams.tsv ~/Rhodo_gcs -m gcs I get multiple output text files, one of which has all the CA phages compared to each other, and the multiple other ones just have one phage each in them compared to itself. I tried running the upgrade command pip install –upgrade phamclust but it gave me the error: ERROR: Invalid requirement: '–upgrade' Can you advise re: the upgrade command? Should I be entering it after a specific other command? I've tried a couple ways. Best, Amanda

Link to this post | posted 10 May, 2022 17:02

afreise

Hi Christian,

I think I may be running into the same bug (?) when I try to compare multiple clusters. Right now I'm trying to compare between clusters CA, CB, CC and CE. when I run the following commands,

1. mysql -u root -p Actino_Draft -e "SELECT PhageID, PhamID FROM gene WHERE PhageID IN (SELECT PhageID FROM phage WHERE Cluster IN ('CA', 'CB', 'CC', 'CE' )) ORDER BY PhageID, Stop ASC;" -B -N >> ~/Rhodo_phams.tsv

2. phamclust ~/Rhodo_phams.tsv ~/Rhodo_gcs -m gcs

I get multiple output text files, one of which has all the CA phages compared to each other, and the multiple other ones just have one phage each in them compared to itself.
I tried running the upgrade command

pip install –upgrade phamclust

but it gave me the error:
ERROR: Invalid requirement: '–upgrade'
Can you advise re: the upgrade command? Should I be entering it after a specific other command? I've tried a couple ways.

Best,
Amanda

Recent Activity

Difficulty pulling large-scale data for batch GCS analysis using Pope/Mavrich 2017 scripts