Skip to content

Conversation

@lkirk
Copy link
Contributor

@lkirk lkirk commented Dec 16, 2025

This PR provides some documentation for the ld_matrix method so that we can consider it a finalized, public API method.

I've tried to provide a sufficient level of detail without going too deep into the method. I also provide some simple demos with the provided example tree sequences. I attempt to provide everything a user would need in the docstring for the ld_matrix method, I'm not sure if it's too much detail.

I didn't touch the existing Todo note in the Multi site statistics section of the documentation, I wasn't exactly sure how we wanted to handle that. To my knowledge, the LdCalculator still exists and it would be useful to frame this method in context with that one.

cc: @petrelharp @apragsdale

@codecov
Copy link

codecov bot commented Dec 16, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.76%. Comparing base (25a26a7) to head (c1fc5c7).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3353   +/-   ##
=======================================
  Coverage   89.76%   89.76%           
=======================================
  Files          29       29           
  Lines       31292    31292           
  Branches     5738     5738           
=======================================
  Hits        28089    28089           
  Misses       1794     1794           
  Partials     1409     1409           
Flag Coverage Δ
c-tests 86.77% <ø> (ø)
lwt-tests 80.38% <ø> (ø)
python-c-tests 87.12% <ø> (ø)
python-tests 98.85% <ø> (ø)
python-tests-no-jit 33.51% <ø> (ø)
python-tests-numpy1 50.28% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
python/tskit/trees.py 98.89% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

subset of sites, either by specifying a single vector for both rows and columns
or a pair of vectors for the row sites and column sites separately.

The following example produces a matrix containing {math}`r^2` computed pairwise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The following example produces a matrix containing {math}`r^2` computed pairwise
The following computes a matrix of {math}`r^2` measure of linkage-disequilibrium (LD) pairwise

The two-locus API provides a mechanism by which to subset the samples under
consideration, providing the ability to compute a separate LD matrix for each
sample set or an LD matrix between sample sets. Output dimensions are handled in
the same manner as the rest of the stats api (see
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the same manner as the rest of the stats api (see
the same manner as the rest of the stats API (see

: {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = p_{ab} - p_{a}p_{b}`

This statistic is inherently polarised, as the unpolarised result of this
statistic is expected to be zero. Uses the `total` normalisation method.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expectation is zero when unpolarised, but actual values can differ from zero, particularly in small populations or under intense selection, no?

:param list sample_sets: A list of lists of Node IDs, specifying the
groups of nodes to compute the statistic with.
Defaults to all samples grouped by population. TODO: does it?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lingering TODO

@gregorgorjanc
Copy link
Member

gregorgorjanc commented Dec 16, 2025

I spotted minor bits for you to check/consider. Someone more versed in the tskit code should also have a look;)

@apragsdale
Copy link

Thank you @lkirk! I’ll give detailed comments on it later this evening. Great to see your methods getting documentation

Copy link

@apragsdale apragsdale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Lloyd - many small, minor comments here that will hopefully help with documentation clarity in places. Thanks again!

The {meth}`~TreeSequence.ld_matrix` method provides an interface to a collection
of two-locus statistics with predefined summary functions (see
{ref}`sec_stats_two_locus_summary_functions`) and `site` and `branch` {ref}`modes
<sec_stats_mode>`. This method differs from the other stats methods because it

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The API for this method differs..."

of two-locus statistics with predefined summary functions (see
{ref}`sec_stats_two_locus_summary_functions`) and `site` and `branch` {ref}`modes
<sec_stats_mode>`. This method differs from the other stats methods because it
provides a collection of statistics, instead of the usual one method per

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, "The LD matrix method differs from other statistics methods in that it provides a unified API with an argument to specify different two-locus summaries of the data." Or something along those lines.

<sec_stats_mode>`. This method differs from the other stats methods because it
provides a collection of statistics, instead of the usual one method per
stat. Otherwise, it behaves similarly to most other functions with respect to
`sample_sets` and `indexes`. Site statistics can be computed from multi-allelic

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, "Two-locus statistics can be computed using two modes, either site or branch, and these should be interpreted in the same was as single-site statistics. Site statistics allow for multi-allelic data, while branch statistics assume an infinite sites model."


#### Site

The `site` mode computes two-locus statistics from pairs of alleles on sites. By

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"from pairs of alleles on sites" - it's not entirely clear what this means. Maybe, "summarized over all pairs of alleles between specified sites."?


The `site` mode computes two-locus statistics from pairs of alleles on sites. By
default, this method will compute a matrix for all pairs of sites, with rows and
columns representing each site in the tree sequence (ie an n×n matrix where n is

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"i.e.,".

Usually use "n" for sample size, maybe "m" for number of sites.

In the site mode, the sites under consideration can be restricted using the
``sites`` argument. Sites can be passed as a list of lists, specifying the
``[[row_sites], [col_sites]]`` or by specifying ``[all_sites]``, where a square

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though note that it doesn't have to be "all sites". It can be a single subset of sites on the tree, in which case we get a square matrix.

We can also compute two-way LD statistics between two sample sets. If the
``indexes`` argument is specified, at least two sample sets must also be
specified. ``indexes`` specifies the sample sets indexes between which to

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"sample set indexes" ... or "sample sets' indexes"?

:math:`\widehat{\pi_2}` n n "pi2_unbiased"
========================= ============ ================= =================
:param list sample_sets: A list of lists of Node IDs, specifying the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of sample node IDs, maybe?

mode. [[row_sites], [col_sites]] or [all_sites].
:param list positions: A list of genomic positions to restrict. Can be
specified as a list of lists to control the row and column sites.
[[row_sites], [col_sites]] or [all_sites].

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only applicable in branch mode?

specified as a list of lists to control the row and column sites.
[[row_sites], [col_sites]] or [all_sites].
:param list indexes: A list of 2-tuples or a single 2-tuple, specifying
the indexes of two populations on which to compute two-way LD

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of two sample sets over which

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants