-
Notifications
You must be signed in to change notification settings - Fork 79
LD matrix documentation #3353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
LD matrix documentation #3353
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3353 +/- ##
=======================================
Coverage 89.76% 89.76%
=======================================
Files 29 29
Lines 31292 31292
Branches 5738 5738
=======================================
Hits 28089 28089
Misses 1794 1794
Partials 1409 1409
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
| subset of sites, either by specifying a single vector for both rows and columns | ||
| or a pair of vectors for the row sites and column sites separately. | ||
|
|
||
| The following example produces a matrix containing {math}`r^2` computed pairwise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The following example produces a matrix containing {math}`r^2` computed pairwise | |
| The following computes a matrix of {math}`r^2` measure of linkage-disequilibrium (LD) pairwise |
| The two-locus API provides a mechanism by which to subset the samples under | ||
| consideration, providing the ability to compute a separate LD matrix for each | ||
| sample set or an LD matrix between sample sets. Output dimensions are handled in | ||
| the same manner as the rest of the stats api (see |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| the same manner as the rest of the stats api (see | |
| the same manner as the rest of the stats API (see |
| : {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = p_{ab} - p_{a}p_{b}` | ||
|
|
||
| This statistic is inherently polarised, as the unpolarised result of this | ||
| statistic is expected to be zero. Uses the `total` normalisation method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expectation is zero when unpolarised, but actual values can differ from zero, particularly in small populations or under intense selection, no?
| :param list sample_sets: A list of lists of Node IDs, specifying the | ||
| groups of nodes to compute the statistic with. | ||
| Defaults to all samples grouped by population. TODO: does it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lingering TODO
|
I spotted minor bits for you to check/consider. Someone more versed in the tskit code should also have a look;) |
|
Thank you @lkirk! I’ll give detailed comments on it later this evening. Great to see your methods getting documentation |
apragsdale
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Lloyd - many small, minor comments here that will hopefully help with documentation clarity in places. Thanks again!
| The {meth}`~TreeSequence.ld_matrix` method provides an interface to a collection | ||
| of two-locus statistics with predefined summary functions (see | ||
| {ref}`sec_stats_two_locus_summary_functions`) and `site` and `branch` {ref}`modes | ||
| <sec_stats_mode>`. This method differs from the other stats methods because it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The API for this method differs..."
| of two-locus statistics with predefined summary functions (see | ||
| {ref}`sec_stats_two_locus_summary_functions`) and `site` and `branch` {ref}`modes | ||
| <sec_stats_mode>`. This method differs from the other stats methods because it | ||
| provides a collection of statistics, instead of the usual one method per |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, "The LD matrix method differs from other statistics methods in that it provides a unified API with an argument to specify different two-locus summaries of the data." Or something along those lines.
| <sec_stats_mode>`. This method differs from the other stats methods because it | ||
| provides a collection of statistics, instead of the usual one method per | ||
| stat. Otherwise, it behaves similarly to most other functions with respect to | ||
| `sample_sets` and `indexes`. Site statistics can be computed from multi-allelic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, "Two-locus statistics can be computed using two modes, either site or branch, and these should be interpreted in the same was as single-site statistics. Site statistics allow for multi-allelic data, while branch statistics assume an infinite sites model."
|
|
||
| #### Site | ||
|
|
||
| The `site` mode computes two-locus statistics from pairs of alleles on sites. By |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"from pairs of alleles on sites" - it's not entirely clear what this means. Maybe, "summarized over all pairs of alleles between specified sites."?
|
|
||
| The `site` mode computes two-locus statistics from pairs of alleles on sites. By | ||
| default, this method will compute a matrix for all pairs of sites, with rows and | ||
| columns representing each site in the tree sequence (ie an n×n matrix where n is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"i.e.,".
Usually use "n" for sample size, maybe "m" for number of sites.
| In the site mode, the sites under consideration can be restricted using the | ||
| ``sites`` argument. Sites can be passed as a list of lists, specifying the | ||
| ``[[row_sites], [col_sites]]`` or by specifying ``[all_sites]``, where a square |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though note that it doesn't have to be "all sites". It can be a single subset of sites on the tree, in which case we get a square matrix.
| We can also compute two-way LD statistics between two sample sets. If the | ||
| ``indexes`` argument is specified, at least two sample sets must also be | ||
| specified. ``indexes`` specifies the sample sets indexes between which to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"sample set indexes" ... or "sample sets' indexes"?
| :math:`\widehat{\pi_2}` n n "pi2_unbiased" | ||
| ========================= ============ ================= ================= | ||
| :param list sample_sets: A list of lists of Node IDs, specifying the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of sample node IDs, maybe?
| mode. [[row_sites], [col_sites]] or [all_sites]. | ||
| :param list positions: A list of genomic positions to restrict. Can be | ||
| specified as a list of lists to control the row and column sites. | ||
| [[row_sites], [col_sites]] or [all_sites]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this only applicable in branch mode?
| specified as a list of lists to control the row and column sites. | ||
| [[row_sites], [col_sites]] or [all_sites]. | ||
| :param list indexes: A list of 2-tuples or a single 2-tuple, specifying | ||
| the indexes of two populations on which to compute two-way LD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of two sample sets over which
This PR provides some documentation for the ld_matrix method so that we can consider it a finalized, public API method.
I've tried to provide a sufficient level of detail without going too deep into the method. I also provide some simple demos with the provided example tree sequences. I attempt to provide everything a user would need in the docstring for the
ld_matrixmethod, I'm not sure if it's too much detail.I didn't touch the existing
Todonote in theMulti site statisticssection of the documentation, I wasn't exactly sure how we wanted to handle that. To my knowledge, theLdCalculatorstill exists and it would be useful to frame this method in context with that one.cc: @petrelharp @apragsdale