Draft of article based on discussions about TCP Info data and caveats analyzing it by jduckles · Pull Request #9 · m-lab/knowledgebase

jduckles · 2026-07-01T01:13:30Z

Hey @sermpezis and @robertodauria could you please review and edit this as you see fit. I pulled it together from all the discussion, document, slack context using the new kb article Claude skill in this repo inside of .claude/skills/mlab-kb-article.

… about analyzing it

robertodauria

Thanks! I've added some comments — see below.

robertodauria · 2026-07-02T15:20:08Z

+  ON  ndt7.id   = tcp.id
+  AND ndt7.date = tcp.date
+WHERE
+    DATE(ndt7.a.TestTime) = '2026-06-01'


This isn't actually filtering over the partition column (which is date), which means the query cannot run:

Cannot query over table 'mlab-oti.ndt.ndt7' without a filter over column(s) 'date' that can be used for partition elimination

robertodauria · 2026-07-02T15:21:16Z

+  ON  ndt7.id   = tcp.id
+  AND ndt7.date = tcp.date
+WHERE
+    DATE(ndt7.a.TestTime) = '2026-06-01'


Suggested change

DATE(ndt7.a.TestTime) = '2026-06-01'

ndt7.date = '2026-06-01'

robertodauria · 2026-07-02T15:22:28Z

+  ON  ndt7.id   = tcp.id
+  AND ndt7.date = tcp.date
+WHERE
+    DATE(ndt7.a.TestTime) = "2026-06-01"


Suggested change

DATE(ndt7.a.TestTime) = "2026-06-01"

ndt7.date = "2026-06-01"

robertodauria · 2026-07-02T15:22:43Z

+  ON  ndt7.id   = tcp.id
+  AND ndt7.date = tcp.date
+WHERE
+    DATE(ndt7.a.TestTime) = "2026-06-01"


This isn't actually filtering over the partition column (which is date), which means the query cannot run:

Cannot query over table 'mlab-oti.ndt.ndt7' without a filter over column(s) 'date' that can be used for partition elimination

robertodauria · 2026-07-02T15:24:14Z

+
+<!-- TODO: Add direct link to Pavlos' TCPinfo Colab notebook once it has a stable public URL. -->
+<!-- TODO: Add section on unnesting the raw.Snapshots array in BigQuery for within-connection time series analysis. -->
+<!-- FIXME: Verify that the RTT/RTTVar fields cited above match the current ndt.tcpinfo schema exactly — column paths may differ between the ndt.tcpinfo view and raw tables. -->


I would expect the verification to happen before the KB article is posted. Could you please confirm that the TCPInfo schema matches?

robertodauria · 2026-07-02T15:25:51Z

+
+Files are stored in `.zst`-compressed JSONL format. Pavlos Sermpezis has a [Colab notebook](https://colab.research.google.com/) for snapshot-level analysis — ask on the M-Lab Discuss list or Slack for the current link.
+
+<!-- TODO: Add direct link to Pavlos' TCPinfo Colab notebook once it has a stable public URL. -->


TODOs in code comments aren't very visible — I'd rather wait until we have a public link to add here (if posting this isn't urgent), or create an issue/a CU task to document what is missing before merging this PR, perhaps assigning the person this is blocked on.

Also, AFAIK M-Lab's Slack isn't exactly "public" the same way the Discuss list is, it's on invitation.

robertodauria · 2026-07-02T15:26:45Z

+Files are stored in `.zst`-compressed JSONL format. Pavlos Sermpezis has a [Colab notebook](https://colab.research.google.com/) for snapshot-level analysis — ask on the M-Lab Discuss list or Slack for the current link.
+
+<!-- TODO: Add direct link to Pavlos' TCPinfo Colab notebook once it has a stable public URL. -->
+<!-- TODO: Add section on unnesting the raw.Snapshots array in BigQuery for within-connection time series analysis. -->


Same: either add the section as part of this PR, or create an issue instead of a TODO in a comment.

(this applies to every other TODO in this file)

robertodauria · 2026-07-02T15:32:37Z

+ORDER BY num_snapshots
+```
+
+Comparing the two outputs makes the noise problem concrete: the first query will show a large fraction of 1–2 snapshot rows; the second (UUID-joined) query will show a clean distribution concentrated at 40–100 snapshots.


Since we're inviting a comparison here, I think it would be helpful if the two queries used the same date.

They also LIMIT 10000 in the inner query with no ORDER BY, which I believe makes the output non-deterministic. They then use this sample to compute a percentage, which would be non-deterministic as well.

robertodauria · 2026-07-02T15:34:23Z

+
+## Raw Data on GCS
+
+For analyses requiring the full 10 ms resolution, the complete unthinnned snapshot archives are available in Google Cloud Storage:


Suggested change

For analyses requiring the full 10 ms resolution, the complete unthinnned snapshot archives are available in Google Cloud Storage:

For analyses requiring the full 10 ms resolution, the complete unthinned snapshot archives are available in Google Cloud Storage:

robertodauria · 2026-07-02T15:38:59Z

+gs://archive-measurement-lab/ndt/tcpinfo/YYYY/MM/DD/
+```
+
+Files are stored in `.zst`-compressed JSONL format. Pavlos Sermpezis has a [Colab notebook](https://colab.research.google.com/) for snapshot-level analysis — ask on the M-Lab Discuss list or Slack for the current link.


Files are stored in .zst-compressed JSONL format

This is correct but omits the tarball layer: users will find .tgz archives containing per-connection .jsonl.zst files.

Draft of article based on discussions about TCP Info data and caveats…

f277076

… about analyzing it

jduckles requested review from robertodauria and sermpezis July 1, 2026 01:14

jduckles self-assigned this Jul 1, 2026

jduckles added the documentation Improvements or additions to documentation label Jul 1, 2026

robertodauria requested changes Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Draft of article based on discussions about TCP Info data and caveats analyzing it#9

Draft of article based on discussions about TCP Info data and caveats analyzing it#9
jduckles wants to merge 1 commit into
mainfrom
newarticle/tcpinfo-snapshot-analysis

jduckles commented Jul 1, 2026

Uh oh!

robertodauria left a comment

Uh oh!

robertodauria Jul 2, 2026

Uh oh!

robertodauria Jul 2, 2026

Uh oh!

robertodauria Jul 2, 2026

Uh oh!

robertodauria Jul 2, 2026

Uh oh!

robertodauria Jul 2, 2026

Uh oh!

robertodauria Jul 2, 2026

Uh oh!

robertodauria Jul 2, 2026

Uh oh!

robertodauria Jul 2, 2026

Uh oh!

robertodauria Jul 2, 2026

Uh oh!

robertodauria Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	DATE(ndt7.a.TestTime) = '2026-06-01'
	ndt7.date = '2026-06-01'

	DATE(ndt7.a.TestTime) = "2026-06-01"
	ndt7.date = "2026-06-01"


		Files are stored in `.zst`-compressed JSONL format. Pavlos Sermpezis has a [Colab notebook](https://colab.research.google.com/) for snapshot-level analysis — ask on the M-Lab Discuss list or Slack for the current link.

		<!-- TODO: Add direct link to Pavlos' TCPinfo Colab notebook once it has a stable public URL. -->


		## Raw Data on GCS

		For analyses requiring the full 10 ms resolution, the complete unthinnned snapshot archives are available in Google Cloud Storage:

Uh oh!

Conversation

jduckles commented Jul 1, 2026

Uh oh!

robertodauria left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants