Draft of article based on discussions about TCP Info data and caveats analyzing it#9
Draft of article based on discussions about TCP Info data and caveats analyzing it#9jduckles wants to merge 1 commit into
Conversation
… about analyzing it
robertodauria
left a comment
There was a problem hiding this comment.
Thanks! I've added some comments — see below.
| ON ndt7.id = tcp.id | ||
| AND ndt7.date = tcp.date | ||
| WHERE | ||
| DATE(ndt7.a.TestTime) = '2026-06-01' |
There was a problem hiding this comment.
This isn't actually filtering over the partition column (which is date), which means the query cannot run:
Cannot query over table 'mlab-oti.ndt.ndt7' without a filter over column(s) 'date' that can be used for partition elimination
| ON ndt7.id = tcp.id | ||
| AND ndt7.date = tcp.date | ||
| WHERE | ||
| DATE(ndt7.a.TestTime) = '2026-06-01' |
There was a problem hiding this comment.
| DATE(ndt7.a.TestTime) = '2026-06-01' | |
| ndt7.date = '2026-06-01' |
| ON ndt7.id = tcp.id | ||
| AND ndt7.date = tcp.date | ||
| WHERE | ||
| DATE(ndt7.a.TestTime) = "2026-06-01" |
There was a problem hiding this comment.
| DATE(ndt7.a.TestTime) = "2026-06-01" | |
| ndt7.date = "2026-06-01" |
| ON ndt7.id = tcp.id | ||
| AND ndt7.date = tcp.date | ||
| WHERE | ||
| DATE(ndt7.a.TestTime) = "2026-06-01" |
There was a problem hiding this comment.
This isn't actually filtering over the partition column (which is date), which means the query cannot run:
Cannot query over table 'mlab-oti.ndt.ndt7' without a filter over column(s) 'date' that can be used for partition elimination
|
|
||
| <!-- TODO: Add direct link to Pavlos' TCPinfo Colab notebook once it has a stable public URL. --> | ||
| <!-- TODO: Add section on unnesting the raw.Snapshots array in BigQuery for within-connection time series analysis. --> | ||
| <!-- FIXME: Verify that the RTT/RTTVar fields cited above match the current ndt.tcpinfo schema exactly — column paths may differ between the ndt.tcpinfo view and raw tables. --> |
There was a problem hiding this comment.
I would expect the verification to happen before the KB article is posted. Could you please confirm that the TCPInfo schema matches?
|
|
||
| Files are stored in `.zst`-compressed JSONL format. Pavlos Sermpezis has a [Colab notebook](https://colab.research.google.com/) for snapshot-level analysis — ask on the M-Lab Discuss list or Slack for the current link. | ||
|
|
||
| <!-- TODO: Add direct link to Pavlos' TCPinfo Colab notebook once it has a stable public URL. --> |
There was a problem hiding this comment.
TODOs in code comments aren't very visible — I'd rather wait until we have a public link to add here (if posting this isn't urgent), or create an issue/a CU task to document what is missing before merging this PR, perhaps assigning the person this is blocked on.
Also, AFAIK M-Lab's Slack isn't exactly "public" the same way the Discuss list is, it's on invitation.
| Files are stored in `.zst`-compressed JSONL format. Pavlos Sermpezis has a [Colab notebook](https://colab.research.google.com/) for snapshot-level analysis — ask on the M-Lab Discuss list or Slack for the current link. | ||
|
|
||
| <!-- TODO: Add direct link to Pavlos' TCPinfo Colab notebook once it has a stable public URL. --> | ||
| <!-- TODO: Add section on unnesting the raw.Snapshots array in BigQuery for within-connection time series analysis. --> |
There was a problem hiding this comment.
Same: either add the section as part of this PR, or create an issue instead of a TODO in a comment.
(this applies to every other TODO in this file)
| ORDER BY num_snapshots | ||
| ``` | ||
|
|
||
| Comparing the two outputs makes the noise problem concrete: the first query will show a large fraction of 1–2 snapshot rows; the second (UUID-joined) query will show a clean distribution concentrated at 40–100 snapshots. |
There was a problem hiding this comment.
Since we're inviting a comparison here, I think it would be helpful if the two queries used the same date.
They also LIMIT 10000 in the inner query with no ORDER BY, which I believe makes the output non-deterministic. They then use this sample to compute a percentage, which would be non-deterministic as well.
|
|
||
| ## Raw Data on GCS | ||
|
|
||
| For analyses requiring the full 10 ms resolution, the complete unthinnned snapshot archives are available in Google Cloud Storage: |
There was a problem hiding this comment.
| For analyses requiring the full 10 ms resolution, the complete unthinnned snapshot archives are available in Google Cloud Storage: | |
| For analyses requiring the full 10 ms resolution, the complete unthinned snapshot archives are available in Google Cloud Storage: |
| gs://archive-measurement-lab/ndt/tcpinfo/YYYY/MM/DD/ | ||
| ``` | ||
|
|
||
| Files are stored in `.zst`-compressed JSONL format. Pavlos Sermpezis has a [Colab notebook](https://colab.research.google.com/) for snapshot-level analysis — ask on the M-Lab Discuss list or Slack for the current link. |
There was a problem hiding this comment.
Files are stored in
.zst-compressed JSONL format
This is correct but omits the tarball layer: users will find .tgz archives containing per-connection .jsonl.zst files.
Hey @sermpezis and @robertodauria could you please review and edit this as you see fit. I pulled it together from all the discussion, document, slack context using the new kb article Claude skill in this repo inside of
.claude/skills/mlab-kb-article.