Skip to content

read_nsrdb_psm4: parse header with the csv module to keep quoted commas#2771

Merged
kandersolar merged 1 commit into
pvlib:mainfrom
gaoflow:fix-2736-nsrdb-psm4-quoted-columns
Jun 9, 2026
Merged

read_nsrdb_psm4: parse header with the csv module to keep quoted commas#2771
kandersolar merged 1 commit into
pvlib:mainfrom
gaoflow:fix-2736-nsrdb-psm4-quoted-columns

Conversation

@gaoflow

@gaoflow gaoflow commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

What this fixes

read_nsrdb_psm4 parsed its three header lines with a naive str.split(','):

metadata_fields = fbuf.readline().split(',')
metadata_values = fbuf.readline().split(',')
columns        = fbuf.readline().split(',')

The NSRDB spectral-on-demand CSVs reported in #2736 have quoted column
names that contain commas, e.g.

..., "GaAs (Bauhuis et al., 2009)","InGaP (Gray, 2008)", ...

These are valid CSV (the commas are inside quotes), and pandas.read_csv
parses the data rows correctly — but str.split(',') splits each quoted name
into multiple fragments, inflating the column count. The mismatch between the
mis-split names/usecols and the correctly-parsed data then raises on read.

The change

Parse the three header lines with the csv module (which honors quoting)
instead of str.split(','). For ordinary (unquoted) files this is identical
to the previous behavior, so the existing readers are unaffected.

This addresses the parsing crash that @kandersolar confirmed should be
supported. The further map_variables=True unit handling for spectral files
(W/m²/µm → W/m²/nm) mentioned in the issue is a separate enhancement and is
left out of scope here.

Reproduction (before this PR)

from io import StringIO
from pvlib.iotools import psm4

content = (
    "Source,Location ID,City,State,Country,Latitude,Longitude,Time Zone,"
    "Elevation,Local Time Zone,Version\n"
    "NSRDB,1,-,-,-,40.0,-105.0,-7,1600,-7,4.0.1\n"
    'Year,Month,Day,Hour,Minute,GHI,"GaAs (Bauhuis et al., 2009)",'
    '"InGaP (Gray, 2008)"\n'
    "2023,1,1,0,0,0,0.1,0.2\n"
    "2023,1,1,1,0,5,0.3,0.4\n"
)
psm4.read_nsrdb_psm4(StringIO(content), map_variables=False)
# ParserError: Too many columns specified: expected 10 and found 8

After the fix the quoted columns survive intact
('GaAs (Bauhuis et al., 2009)', 'InGaP (Gray, 2008)').

A regression test (test_read_nsrdb_psm4_quoted_columns_with_commas) is added
that fails on main and passes with this change; the existing
read_nsrdb_psm4 tests continue to pass.

read_nsrdb_psm4 split the three header lines with a naive str.split(','),
which broke spectral-on-demand files whose column names are quoted fields
containing commas (e.g. '"GaAs (Bauhuis et al., 2009)"'). Such names were
split into spurious columns, raising on read. Parse the header lines with
the csv module so quoted fields are kept intact.

Fixes pvlib#2736
@gaoflow gaoflow force-pushed the fix-2736-nsrdb-psm4-quoted-columns branch from e79f87e to 88f18aa Compare June 8, 2026 13:43
@kandersolar kandersolar added this to the v0.15.2 milestone Jun 9, 2026
@kandersolar kandersolar added the remote-data triggers --remote-data pytests label Jun 9, 2026

@kandersolar kandersolar left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR, thanks @gaoflow

@kandersolar kandersolar changed the title Parse read_nsrdb_psm4 header with the csv module to keep quoted commas (fixes #2736) read_nsrdb_psm4: parse header with the csv module to keep quoted commas Jun 9, 2026
@kandersolar kandersolar merged commit 044ad59 into pvlib:main Jun 9, 2026
37 of 39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug io remote-data triggers --remote-data pytests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Issue with importing NSRDB spectral-on-demand files with pvlib.iotools.read_nsrdb_psm4

2 participants