feat: OpenAlex/PubMed ETL pipeline with WoS schema standardization by isx9 · Pull Request #24 · PRAISELab-PicusLab/bibliometrix-python

isx9 · 2026-06-21T15:31:37Z

Group Members

Isabella Di Lorenzi — D03000224
Maria Pasconcino — D03000295

Summary

This PR adds an Extract → Transform → Validate pipeline (bibliometrix-python-dev) that retrieves records from the OpenAlex and PubMed REST APIs and standardizes them into the Web of Science (WoS) schema expected by the bibliometrix-python dashboard. It also documents and patches the dashboard-side bugs uncovered while validating that standardized output against every analytical function in services/ and functions/.

1. Architecture

1.1 Pipeline overview

┌──────────────┐    ┌────────────────┐    ┌──────────────┐
│  api_retriever │──▶│  standardizer  │──▶│  validator   │──▶ CSV / DataFrame
│    .py         │    │  .py + mappings│    │  .py         │
│                │    │  .py           │    │              │
│ • OpenAlex API │    │ • Rename cols  │    │ • Schema ✓   │
│ • PubMed API   │    │ • Parse nested │    │ • No null ✓  │
│ • pagination,  │    │   fields       │    │ • Types ✓    │
│   retries      │    │ • Type-cast    │    │              │
│                │    │ • Compute SR   │    │              │
└──────────────┘    └────────────────┘    └──────────────┘

Each phase lives in its own module, with no monolithic function. execution_evidence.ipynb wires them together as retrieve() → standardize() → validate(), the functional equivalent of the R package's convert2df(), which was the missing single entry point in the original codebase.

1.2 The Dispatcher

Dispatch happens at exactly two points — the boundary of each phase — rather than being re-implemented by every downstream consumer (the failure mode in the legacy format_functions.py, where every column-level formatter re-branches on source):

# api_retriever.py — Extract-level dispatch
def retrieve(query: str, platform: str = "openalex", total: int = 100,
             mindate: str = None, maxdate: str = None) -> list:
    if platform == "openalex":
        return fetch_openalex(query=query, total_wanted=total)
    elif platform == "pubmed":
        return fetch_pubmed(query=query, total_wanted=total, mindate=mindate, maxdate=maxdate)
    else:
        raise ValueError(f"Unsupported platform: {platform}. Choose 'openalex' or 'pubmed'.")

# standardizer.py — Transform-level dispatch
def standardize(records: list, source: str) -> pd.DataFrame:
    standardized = []
    for record in records:
        if source == "pubmed":
            standardized.append(standardize_pubmed(record))
        elif source == "openalex":
            standardized.append(standardize_openalex(record))
        else:
            raise ValueError(f"Unsupported source: {source}. Choose 'pubmed' or 'openalex'.")
    df = pd.DataFrame(standardized)
    return SR(df)

fetch_openalex() / fetch_pubmed() each absorb that API's own pagination, ID-resolution, and rate-limiting, but both return the same shape (list[dict]). standardize_openalex() / standardize_pubmed() each own all null-handling, list-casting, and default-filling for their source internally — source-specific logic is contained to these two switch points instead of leaking into every caller.

1.3 Mapping dictionaries

mappings.py holds flat raw_field → WoS_tag dictionaries, applied via apply_mapping():

OPENALEX_MAPPING = {
    "id":               "UT",
    "doi":              "DI",
    "title":            "TI",
    "publication_year": "PY",
    "language":         "LA",
    "type":             "DT",
}

def apply_mapping(record: dict, mapping: dict) -> dict:
    """Renames raw API field names to WoS tags. Only processes fields in the mapping."""
    result = {}
    for raw_field, wos_tag in mapping.items():
        result[wos_tag] = record.get(raw_field, "")
    return result

The dictionary covers only direct 1:1 renames. Each one is followed by inline comments enumerating the two exception classes that can't go in a lookup table: fields needing structural transformation (OpenAlex's nested authorships, abstract_inverted_index, biblio; PubMed's articleids, pages), handled by dedicated parse_* functions; and fields with no equivalent on that source (ID for both, PMID for OpenAlex, TC/C1/DE/AU_CO for PubMed's eSummary response), explicitly filled with safe defaults rather than silently absent. One parse_* function goes further than reshaping: OpenAlex's referenced_works is just a list of bare work IDs, so parse_openalex_references() hands those IDs to resolve_openalex_references(), which calls back out to the OpenAlex API and reconstructs proper "Smith J, 2019, NAT COMMUN"-style strings for CR — the one field that round-trips through a second API call rather than transforming data already in hand.

1.4 Type contracts

validator.py defines every WoS tag's required Python type and runs three checks against it:

MANDATORY_COLUMNS = {
    "DB": str, "UT": str, "DI": str, "PMID": str, "TI": str, "SO": str, "JI": str,
    "PY": str, "DT": str, "LA": str, "TC": int, "AU": list, "AF": list, "C1": list,
    "RP": str, "CR": list, "DE": list, "ID": list, "AB": str, "VL": str, "IS": str,
    "BP": str, "EP": str, "SR": str,
}

def validate(df: pd.DataFrame) -> pd.DataFrame:
    missing_cols = check_columns(df)   # presence
    null_cols    = check_nulls(df)     # no None / NaN anywhere
    type_cols    = check_types(df)     # isinstance per column
    if missing_cols or null_cols or type_cols:
        raise ValueError("Validation failed. Fix the issues above before proceeding.")
    return df

The null-handling convention ("" for scalars, [] for lists, never None/NaN) is enforced twice: inside standardize_openalex()/standardize_pubmed() via an explicit sanitization pass, and again externally by the validator as a hard gate. A regression in either layer is still caught by the other.

2. Files Changed

New files (www/services/)

File	Purpose
`api_retriever.py`	Extract phase — OpenAlex & PubMed API retrieval, pagination, retries, rate-limit handling
`mappings.py`	`PUBMED_MAPPING` / `OPENALEX_MAPPING` declarative rename dictionaries
`standardizer.py`	Transform phase — dispatch, field parsing, type-casting, `SR` computation
`validator.py`	Validation phase — schema, null, and type contract checks

3. Debugging & Patches to Analytical Functions

3.1 Methodology

Once the ETL produced valid, schema-compliant DataFrames, each was run through every relevant dashboard panel against a 200-row test set per source (test_openalex_200.csv, test_pubmed_200.csv). Every failure was classified before any fix was attempted:

Bug — the function crashes or silently produces wrong output even though the input satisfies the WoS contract.
Data-characteristic limitation — the function behaves correctly, but the source itself lacks the underlying information the analysis needs (e.g. PubMed's eSummary endpoint never returns abstracts or citation counts), so an empty or minimal result is the correct answer, not a bug.

Only items in the first bucket were patched. Items in the second are documented in 3.3 so graders and future maintainers don't mistake "no API support" for an ETL defect. Full technical detail (exact line, root cause, fix) for every file below lives in patching_documented.md.

3.2 Files patched — confirmed fixes

By far the most common single fix, applied across nearly every file below: replacing hasattr(df, "get") (which is also true for a plain pandas.DataFrame, since it has its own .get() method) with isinstance(df, pd.DataFrame) to correctly tell a Shiny reactive object apart from a plain DataFrame. The rest of each file's patches handle missing/empty columns, non-numeric PY/TC, and degenerate network/matrix cases gracefully instead of crashing.

www/services/

biblionetwork.py, cocmatrix.py, couplingmap.py, format_functions.py, histnetwork.py, metatagextraction.py, networkplot.py, termextraction.py, thematicmap.py

www/functions/

get_affiliationproductionovertime.py, get_annualproduction.py, get_authorlocalimpact.py, get_authorproductionovertime.py, get_averagecitations.py, get_bradfordlaw.py, get_citedcountries.py, get_citeddocuments.py, get_clusteringcoupling.py, get_co_occurence_network.py, get_cocitation.py, get_collaborationnetwork.py, get_correspondingauthorcountries.py, get_countriesproduction.py, get_countriesproductionovertime.py, get_factorialanalysis.py, get_filters.py, get_frequentwords.py, get_historiograph.py, get_localcitedauthors.py, get_localciteddocuments.py, get_localcitedreferences.py, get_localcitedsources.py, get_lotkalaw.py, get_maininformations.py, get_referencesspectroscopy.py, get_relevantaffiliations.py, get_relevantauthors.py, get_relevantsources.py, get_sourceslocalimpact.py, get_sourcesproduction.py, get_table.py, get_thematicevolution.py, get_thematicmap.py, get_threefieldplot.py, get_treemap.py, get_trendtopics.py, get_wordcloud.py, get_wordfrequency.py, get_worldmapcollaboration.py

tabletag.py required no patch (passes as-is on both sources).

app.py

Co-citation panel callback misfired due to an always-truthy check on the result tuple — fixed the truthiness check.
Factorial Analysis incorrectly excluded the Title (TI) field via a SilentException — removed that exclusion.

3.3 Data-characteristic limitations (not bugs)

Field / Feature	Limitation	Affected source(s)
`ID` (Keywords Plus)	WoS-exclusive field, no equivalent in either source's API — always empty	OpenAlex, PubMed
`WC` (Subject Categories)	No equivalent in either source; the column is never added to the standardized schema. `table_tag()` checks `if tag not in M.columns` before accessing it, so every consuming function degrades the same way `ID` does — a clean empty result, not a crash — verified by direct testing (WordCloud, Most Frequent Words, TreeMap) on both sources	OpenAlex, PubMed
`AU_UN` (Author Institutions)	Collaboration Network returns nothing when Field is set to Institutions. The institution-extraction heuristic scans affiliation strings for WoS-convention tags (`UNIV`, `INST`, `COLL`); OpenAlex's `raw_affiliation_strings` don't follow that comma-segmented structure, so it largely fails to isolate clean institution names — the same underlying affiliation-format mismatch as the `AU_CO`/country case. For PubMed the empty result is expected for a different reason (`C1` is itself largely empty), but this hasn't been directly verified	Both
`CR` (Cited References)	PubMed: recovered for only ~7.5% of records in the 200-row test sample (not zero, just sparse) — eSummary returns free-text reference strings for very few records. Impacts every panel that needs within-sample citation matching (Co-citation Network, Cluster by Coupling, Historiograph, Local Cited Authors/Documents/Sources/References, Three-Field Plot on Cited Sources) on PubMed data	PubMed
`AB` (Abstract)	PubMed's eSummary endpoint never returns abstracts — any panel with Abstracts selected as the field has no vocabulary to work with	PubMed
`TC` (Times Cited)	PubMed's eSummary endpoint never returns citation counts — stays `0` for every record, filled as a safe default	PubMed
`PY` distribution	In the 200-row PubMed test sample, 139/200 rows fall in a single year (2024) across only 4 distinct years total — too narrow a spread for Thematic Evolution's time-binning to form usable periods. A property of the test sample, not of the field itself	PubMed

4. Validation Summary

The patches in 3.2 bring the entire set of assigned services/ and functions/ modules to crash-free, graceful execution on the 200-row test sets for both OpenAlex and PubMed (full per-function status in patching_documented.md, modulo the four entries flagged in 3.3 as superseded). The remaining gaps are the data-characteristic limitations in 3.3 — empty or reduced output where the source itself doesn't provide the underlying information, not pipeline defects.

5. Live Dashboard Integration (Bonus)

Beyond the standalone CSV deliverable, the API query feature is wired directly into the Shiny dashboard, satisfying the spec's optional bonus ("Integrate this API query feature directly into the Python Shiny interface").

Two changes were required:

www/services/__init__.py now exports the three ETL entry points:
```
from .api_retriever import retrieve
from .standardizer import standardize
from .validator import validate
```
Without this, app.py cannot see these functions via from www.services import * and fails with name 'retrieve' is not defined.
app.py replaces the previous placeholder panel ("🚧 API is under construction 🚧") with a working "API Query" panel: a text input for the search query, a platform selector (OpenAlex / PubMed), a record-count input, and a "Run Query" button. On click:
```
records = retrieve(query=query, platform=platform, total=total)
df_api = standardize(records, source=platform)
df_api = validate(df_api)

api_result.set(df_api)
df.set(df_api)  # makes the data available to the entire dashboard
```
The panel also renders the first rows of the normalized result (AU, TI, PY, SO, TC, DB, SR) directly in the UI, satisfying the "print the first few normalized rows" requirement without a separate script.

This path never touches a CSV file — data goes straight from validate() into the shared reactive df. The CSV-serialization conventions discussed elsewhere (e.g. the ; delimiter for multi-value fields) apply only to the standalone CSV deliverable and to the dashboard's separate "Import Raw Data" upload panel, not to this live-query path.

6. Execution Evidence

execution_evidence.ipynb is the step-by-step execution log required by the spec, run end to end against both APIs:

Extract → Transform → Validate on 10 sample records per source, with df.head() previews printed at each step, to inspect the raw API response shape before standardization.
Full run on 200 records per source, saved to test_openalex_200.csv / test_pubmed_200.csv with the ; delimiter convention described in Section 1.3.
Phase 4 — CSV round-trip integrity check: both CSVs are reloaded from disk and re-validated, confirming the type contract (lists stay lists, no blank cell becomes NaN) survives a full save/reload cycle — relevant for the dashboard's separate "Import Raw Data" upload panel, as opposed to the live in-memory path described in Section 5.

Sample output from the notebook:

records = retrieve(query="machine learning", platform="openalex", total=200)
df = standardize(records, source="openalex")
df = validate(df)

Running validation...
  PASS — all mandatory columns present
  PASS — no null values found
  PASS — all column types correct
Validation passed.

SR sample: Pedregosa F, 2012, ARXIV (CORNELL UNIVERSITY)

The same flow with platform="pubmed", source="pubmed" produces test_pubmed_200.csv. Both CSVs can be uploaded directly to the dashboard after converting them in a XLSX format.

Brief description inside.

Description inside.

Finished www/services/

Completed for both OpenAlex and PubMed.

Added SR calculation to standardize() main function

Add validator.py with column, null and type checks

Copy SR function directly into standardizer to avoid relative import error

Solved bugs: - `LA` (PubMed) returned as a list `['eng']` by the API, needed to extract first element as string - `DI` (OpenAlex) returned as full URL `"https://doi.org/..."`, needed to strip the prefix. Also `None` for some records, needed `or ""` guard - `PY` (OpenAlex) returned as integer `2012`, needed to cast to string `"2012"`

…o dev # Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

…o dev

…tibility

This reverts commit 8205ed2.

isx9 and others added 30 commits May 20, 2026 20:13

Create audit.md

78a1a22

Update audit services.md

2f7a6bb

Create standardizer.py

67a44ff

Brief description inside.

Create api_retriever.py

d2317b6

Description inside.

Create validator.py

e76adb1

Description inside.

Update audit.md

028d4c8

Finished www/services/

Update audit.md

0a22075

Update audit.md

99e1a43

Update audit.md

a3ddabf

Update audit.md

0122bb4

Update audit.md

9615264

Add files via upload

0552c24

Update audit.md

5a06262

Update audit.md

01dffaf

Update audit.md

2c497e9

Update audit.md

1d65f1a

Update audit.md

dc2bfc4

Update audit.md

920d19e

Update audit.md

ccdcfca

Update audit.md

df564b9

Update audit.md

da6a2da

mappings.py

8a2279f

Upload standardizer.py

53f92e5

Completed for both OpenAlex and PubMed.

SR calculation

40cd04f

Added SR calculation to standardize() main function

Upload validator.py

65465ea

Add validator.py with column, null and type checks

New upload standardizer.py

11d650d

Copy SR function directly into standardizer to avoid relative import error

Update audit.md

86d1fed

Update audit.md

c67866f

Update audit.md

061f303

isx9 and others added 30 commits June 14, 2026 13:20

Update patching_documented.md

f81cdd4

Update patching_documented.md

801bce7

Update patching_documented.md

4b31207

OpenAlex fixes and debugging

f17361c

Merge origin/dev

a635fbb

Finished patching_documented.md

6384743

Finish merge with origin/dev

be82fee

Ignore virtual environment

47eb9d0

WIP local cited authors debugging

73a4f05

WIP fix local cited authors

4c2272e

Merge branch 'dev' of https://github.com/isx9/bibliometrix-python int…

0c2390c

…o dev

Patch ETL pipeline and analytical functions for OpenAlex/PubMed compa…

5106aa0

…tibility

Remove .venv from version control

1cdcabe

fix: restore corrected ETL files and standardizer improvements

8205ed2

Revert "fix: restore corrected ETL files and standardizer improvements"

7a2900e

This reverts commit 8205ed2.

modifica per grafico vuoto in Local Cited Authors

57bd350

now these work

3ff644b

to fix cocitation

6a733b6

Add PubMed date filtering and fix Thematic Evolution crashes

5ee61b0

patching

480b067

Update patching_documented.md

9fb303b

Update patching_documented.md

dc0219a

Create limitations.md

0524cc7

update notebook & files

e03533d

Update limitations.md

dcc72fa

Rename demo_etl.ipynb to execution_evidence.ipynb

6e1b41a

Add files via upload

e4b1aa7

Delete patching_documented.md

86a2377

Rename patching_documented (2).md to patching_documented.md

bf36b43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: OpenAlex/PubMed ETL pipeline with WoS schema standardization#24

feat: OpenAlex/PubMed ETL pipeline with WoS schema standardization#24
isx9 wants to merge 112 commits into
PRAISELab-PicusLab:mainfrom
isx9:dev

isx9 commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

isx9 commented Jun 21, 2026

Group Members

Summary

1. Architecture

1.1 Pipeline overview

1.2 The Dispatcher

1.3 Mapping dictionaries

1.4 Type contracts

2. Files Changed

3. Debugging & Patches to Analytical Functions

3.1 Methodology

3.2 Files patched — confirmed fixes

3.3 Data-characteristic limitations (not bugs)

4. Validation Summary

5. Live Dashboard Integration (Bonus)

6. Execution Evidence

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants