Skip to content

feat: OpenAlex/PubMed ETL pipeline with WoS schema standardization#24

Open
isx9 wants to merge 112 commits into
PRAISELab-PicusLab:mainfrom
isx9:dev
Open

feat: OpenAlex/PubMed ETL pipeline with WoS schema standardization#24
isx9 wants to merge 112 commits into
PRAISELab-PicusLab:mainfrom
isx9:dev

Conversation

@isx9

@isx9 isx9 commented Jun 21, 2026

Copy link
Copy Markdown

Group Members

Isabella Di Lorenzi — D03000224
Maria Pasconcino — D03000295

Summary

This PR adds an Extract → Transform → Validate pipeline (bibliometrix-python-dev) that retrieves records from the OpenAlex and PubMed REST APIs and standardizes them into the Web of Science (WoS) schema expected by the bibliometrix-python dashboard. It also documents and patches the dashboard-side bugs uncovered while validating that standardized output against every analytical function in services/ and functions/.

1. Architecture

1.1 Pipeline overview

┌──────────────┐    ┌────────────────┐    ┌──────────────┐
│  api_retriever │──▶│  standardizer  │──▶│  validator   │──▶ CSV / DataFrame
│    .py         │    │  .py + mappings│    │  .py         │
│                │    │  .py           │    │              │
│ • OpenAlex API │    │ • Rename cols  │    │ • Schema ✓   │
│ • PubMed API   │    │ • Parse nested │    │ • No null ✓  │
│ • pagination,  │    │   fields       │    │ • Types ✓    │
│   retries      │    │ • Type-cast    │    │              │
│                │    │ • Compute SR   │    │              │
└──────────────┘    └────────────────┘    └──────────────┘

Each phase lives in its own module, with no monolithic function. execution_evidence.ipynb wires them together as retrieve() → standardize() → validate(), the functional equivalent of the R package's convert2df(), which was the missing single entry point in the original codebase.

1.2 The Dispatcher

Dispatch happens at exactly two points — the boundary of each phase — rather than being re-implemented by every downstream consumer (the failure mode in the legacy format_functions.py, where every column-level formatter re-branches on source):

# api_retriever.py — Extract-level dispatch
def retrieve(query: str, platform: str = "openalex", total: int = 100,
             mindate: str = None, maxdate: str = None) -> list:
    if platform == "openalex":
        return fetch_openalex(query=query, total_wanted=total)
    elif platform == "pubmed":
        return fetch_pubmed(query=query, total_wanted=total, mindate=mindate, maxdate=maxdate)
    else:
        raise ValueError(f"Unsupported platform: {platform}. Choose 'openalex' or 'pubmed'.")
# standardizer.py — Transform-level dispatch
def standardize(records: list, source: str) -> pd.DataFrame:
    standardized = []
    for record in records:
        if source == "pubmed":
            standardized.append(standardize_pubmed(record))
        elif source == "openalex":
            standardized.append(standardize_openalex(record))
        else:
            raise ValueError(f"Unsupported source: {source}. Choose 'pubmed' or 'openalex'.")
    df = pd.DataFrame(standardized)
    return SR(df)

fetch_openalex() / fetch_pubmed() each absorb that API's own pagination, ID-resolution, and rate-limiting, but both return the same shape (list[dict]). standardize_openalex() / standardize_pubmed() each own all null-handling, list-casting, and default-filling for their source internally — source-specific logic is contained to these two switch points instead of leaking into every caller.

1.3 Mapping dictionaries

mappings.py holds flat raw_field → WoS_tag dictionaries, applied via apply_mapping():

OPENALEX_MAPPING = {
    "id":               "UT",
    "doi":              "DI",
    "title":            "TI",
    "publication_year": "PY",
    "language":         "LA",
    "type":             "DT",
}

def apply_mapping(record: dict, mapping: dict) -> dict:
    """Renames raw API field names to WoS tags. Only processes fields in the mapping."""
    result = {}
    for raw_field, wos_tag in mapping.items():
        result[wos_tag] = record.get(raw_field, "")
    return result

The dictionary covers only direct 1:1 renames. Each one is followed by inline comments enumerating the two exception classes that can't go in a lookup table: fields needing structural transformation (OpenAlex's nested authorships, abstract_inverted_index, biblio; PubMed's articleids, pages), handled by dedicated parse_* functions; and fields with no equivalent on that source (ID for both, PMID for OpenAlex, TC/C1/DE/AU_CO for PubMed's eSummary response), explicitly filled with safe defaults rather than silently absent. One parse_* function goes further than reshaping: OpenAlex's referenced_works is just a list of bare work IDs, so parse_openalex_references() hands those IDs to resolve_openalex_references(), which calls back out to the OpenAlex API and reconstructs proper "Smith J, 2019, NAT COMMUN"-style strings for CR — the one field that round-trips through a second API call rather than transforming data already in hand.

1.4 Type contracts

validator.py defines every WoS tag's required Python type and runs three checks against it:

MANDATORY_COLUMNS = {
    "DB": str, "UT": str, "DI": str, "PMID": str, "TI": str, "SO": str, "JI": str,
    "PY": str, "DT": str, "LA": str, "TC": int, "AU": list, "AF": list, "C1": list,
    "RP": str, "CR": list, "DE": list, "ID": list, "AB": str, "VL": str, "IS": str,
    "BP": str, "EP": str, "SR": str,
}

def validate(df: pd.DataFrame) -> pd.DataFrame:
    missing_cols = check_columns(df)   # presence
    null_cols    = check_nulls(df)     # no None / NaN anywhere
    type_cols    = check_types(df)     # isinstance per column
    if missing_cols or null_cols or type_cols:
        raise ValueError("Validation failed. Fix the issues above before proceeding.")
    return df

The null-handling convention ("" for scalars, [] for lists, never None/NaN) is enforced twice: inside standardize_openalex()/standardize_pubmed() via an explicit sanitization pass, and again externally by the validator as a hard gate. A regression in either layer is still caught by the other.

2. Files Changed

New files (www/services/)

File Purpose
api_retriever.py Extract phase — OpenAlex & PubMed API retrieval, pagination, retries, rate-limit handling
mappings.py PUBMED_MAPPING / OPENALEX_MAPPING declarative rename dictionaries
standardizer.py Transform phase — dispatch, field parsing, type-casting, SR computation
validator.py Validation phase — schema, null, and type contract checks

3. Debugging & Patches to Analytical Functions

3.1 Methodology

Once the ETL produced valid, schema-compliant DataFrames, each was run through every relevant dashboard panel against a 200-row test set per source (test_openalex_200.csv, test_pubmed_200.csv). Every failure was classified before any fix was attempted:

  • Bug — the function crashes or silently produces wrong output even though the input satisfies the WoS contract.
  • Data-characteristic limitation — the function behaves correctly, but the source itself lacks the underlying information the analysis needs (e.g. PubMed's eSummary endpoint never returns abstracts or citation counts), so an empty or minimal result is the correct answer, not a bug.

Only items in the first bucket were patched. Items in the second are documented in 3.3 so graders and future maintainers don't mistake "no API support" for an ETL defect. Full technical detail (exact line, root cause, fix) for every file below lives in patching_documented.md.

3.2 Files patched — confirmed fixes

By far the most common single fix, applied across nearly every file below: replacing hasattr(df, "get") (which is also true for a plain pandas.DataFrame, since it has its own .get() method) with isinstance(df, pd.DataFrame) to correctly tell a Shiny reactive object apart from a plain DataFrame. The rest of each file's patches handle missing/empty columns, non-numeric PY/TC, and degenerate network/matrix cases gracefully instead of crashing.

www/services/

  • biblionetwork.py, cocmatrix.py, couplingmap.py, format_functions.py, histnetwork.py, metatagextraction.py, networkplot.py, termextraction.py, thematicmap.py

www/functions/

  • get_affiliationproductionovertime.py, get_annualproduction.py, get_authorlocalimpact.py, get_authorproductionovertime.py, get_averagecitations.py, get_bradfordlaw.py, get_citedcountries.py, get_citeddocuments.py, get_clusteringcoupling.py, get_co_occurence_network.py, get_cocitation.py, get_collaborationnetwork.py, get_correspondingauthorcountries.py, get_countriesproduction.py, get_countriesproductionovertime.py, get_factorialanalysis.py, get_filters.py, get_frequentwords.py, get_historiograph.py, get_localcitedauthors.py, get_localciteddocuments.py, get_localcitedreferences.py, get_localcitedsources.py, get_lotkalaw.py, get_maininformations.py, get_referencesspectroscopy.py, get_relevantaffiliations.py, get_relevantauthors.py, get_relevantsources.py, get_sourceslocalimpact.py, get_sourcesproduction.py, get_table.py, get_thematicevolution.py, get_thematicmap.py, get_threefieldplot.py, get_treemap.py, get_trendtopics.py, get_wordcloud.py, get_wordfrequency.py, get_worldmapcollaboration.py

tabletag.py required no patch (passes as-is on both sources).

app.py

  • Co-citation panel callback misfired due to an always-truthy check on the result tuple — fixed the truthiness check.
  • Factorial Analysis incorrectly excluded the Title (TI) field via a SilentException — removed that exclusion.

3.3 Data-characteristic limitations (not bugs)

Field / Feature Limitation Affected source(s)
ID (Keywords Plus) WoS-exclusive field, no equivalent in either source's API — always empty OpenAlex, PubMed
WC (Subject Categories) No equivalent in either source; the column is never added to the standardized schema. table_tag() checks if tag not in M.columns before accessing it, so every consuming function degrades the same way ID does — a clean empty result, not a crash — verified by direct testing (WordCloud, Most Frequent Words, TreeMap) on both sources OpenAlex, PubMed
AU_UN (Author Institutions) Collaboration Network returns nothing when Field is set to Institutions. The institution-extraction heuristic scans affiliation strings for WoS-convention tags (UNIV, INST, COLL); OpenAlex's raw_affiliation_strings don't follow that comma-segmented structure, so it largely fails to isolate clean institution names — the same underlying affiliation-format mismatch as the AU_CO/country case. For PubMed the empty result is expected for a different reason (C1 is itself largely empty), but this hasn't been directly verified Both
CR (Cited References) PubMed: recovered for only ~7.5% of records in the 200-row test sample (not zero, just sparse) — eSummary returns free-text reference strings for very few records. Impacts every panel that needs within-sample citation matching (Co-citation Network, Cluster by Coupling, Historiograph, Local Cited Authors/Documents/Sources/References, Three-Field Plot on Cited Sources) on PubMed data PubMed
AB (Abstract) PubMed's eSummary endpoint never returns abstracts — any panel with Abstracts selected as the field has no vocabulary to work with PubMed
TC (Times Cited) PubMed's eSummary endpoint never returns citation counts — stays 0 for every record, filled as a safe default PubMed
PY distribution In the 200-row PubMed test sample, 139/200 rows fall in a single year (2024) across only 4 distinct years total — too narrow a spread for Thematic Evolution's time-binning to form usable periods. A property of the test sample, not of the field itself PubMed

4. Validation Summary

The patches in 3.2 bring the entire set of assigned services/ and functions/ modules to crash-free, graceful execution on the 200-row test sets for both OpenAlex and PubMed (full per-function status in patching_documented.md, modulo the four entries flagged in 3.3 as superseded). The remaining gaps are the data-characteristic limitations in 3.3 — empty or reduced output where the source itself doesn't provide the underlying information, not pipeline defects.

5. Live Dashboard Integration (Bonus)

Beyond the standalone CSV deliverable, the API query feature is wired directly into the Shiny dashboard, satisfying the spec's optional bonus ("Integrate this API query feature directly into the Python Shiny interface").

Two changes were required:

  1. www/services/__init__.py now exports the three ETL entry points:

    from .api_retriever import retrieve
    from .standardizer import standardize
    from .validator import validate

    Without this, app.py cannot see these functions via from www.services import * and fails with name 'retrieve' is not defined.

  2. app.py replaces the previous placeholder panel ("🚧 API is under construction 🚧") with a working "API Query" panel: a text input for the search query, a platform selector (OpenAlex / PubMed), a record-count input, and a "Run Query" button. On click:

    records = retrieve(query=query, platform=platform, total=total)
    df_api = standardize(records, source=platform)
    df_api = validate(df_api)
    
    api_result.set(df_api)
    df.set(df_api)  # makes the data available to the entire dashboard

    The panel also renders the first rows of the normalized result (AU, TI, PY, SO, TC, DB, SR) directly in the UI, satisfying the "print the first few normalized rows" requirement without a separate script.

This path never touches a CSV file — data goes straight from validate() into the shared reactive df. The CSV-serialization conventions discussed elsewhere (e.g. the ; delimiter for multi-value fields) apply only to the standalone CSV deliverable and to the dashboard's separate "Import Raw Data" upload panel, not to this live-query path.

6. Execution Evidence

execution_evidence.ipynb is the step-by-step execution log required by the spec, run end to end against both APIs:

  1. Extract → Transform → Validate on 10 sample records per source, with df.head() previews printed at each step, to inspect the raw API response shape before standardization.
  2. Full run on 200 records per source, saved to test_openalex_200.csv / test_pubmed_200.csv with the ; delimiter convention described in Section 1.3.
  3. Phase 4 — CSV round-trip integrity check: both CSVs are reloaded from disk and re-validated, confirming the type contract (lists stay lists, no blank cell becomes NaN) survives a full save/reload cycle — relevant for the dashboard's separate "Import Raw Data" upload panel, as opposed to the live in-memory path described in Section 5.

Sample output from the notebook:

records = retrieve(query="machine learning", platform="openalex", total=200)
df = standardize(records, source="openalex")
df = validate(df)
Running validation...
  PASS — all mandatory columns present
  PASS — no null values found
  PASS — all column types correct
Validation passed.

SR sample: Pedregosa F, 2012, ARXIV (CORNELL UNIVERSITY)

The same flow with platform="pubmed", source="pubmed" produces test_pubmed_200.csv. Both CSVs can be uploaded directly to the dashboard after converting them in a XLSX format.

isx9 and others added 30 commits May 20, 2026 20:13
Brief description inside.
Description inside.
Description inside.
Finished www/services/
Completed for both OpenAlex and PubMed.
Added SR calculation to standardize() main function
Add validator.py with column, null and type checks
Copy SR function directly into standardizer to avoid relative import error
Solved bugs:
- `LA` (PubMed) returned as a list `['eng']` by the API, needed to extract first element as string
- `DI` (OpenAlex) returned as full URL `"https://doi.org/..."`, needed to strip the prefix. Also `None` for some records, needed `or ""` guard
- `PY` (OpenAlex) returned as integer `2012`, needed to cast to string `"2012"`
isx9 and others added 30 commits June 14, 2026 13:20
…o dev

# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants