Skip to content

feat: Add ETL pipeline for OpenAlex/PubMed with dashboard fixes#23

Open
solmaznazari wants to merge 1 commit into
PRAISELab-PicusLab:mainfrom
solmaznazari:main
Open

feat: Add ETL pipeline for OpenAlex/PubMed with dashboard fixes#23
solmaznazari wants to merge 1 commit into
PRAISELab-PicusLab:mainfrom
solmaznazari:main

Conversation

@solmaznazari

@solmaznazari solmaznazari commented Jun 17, 2026

Copy link
Copy Markdown

Project Overview

This PR implements a robust, source-agnostic ETL pipeline for bibliometrix-python, enabling the Shiny dashboard to work with OpenAlex and PubMed data in addition to Web of Science.


Architecture

Dispatcher Pattern

The pipeline uses a single entry-point transform() function that acts as a dispatcher — regardless of input source (OpenAlex, PubMed, Scopus, WoS), all data passes through the same transformation chain:

Raw Data → extract() → transform() → validate() → Standardized DataFrame

Mapping Dictionary (Type Contracts)

Defined in schemas.py:

  • MULTI_VALUE_FIELDS = ["AU", "AF", "C1", "CR", "DE", "ID"] → must be list[str]
  • STRING_FIELDS = ["TI", "SO", "AB", "SR", ...] → must be str, nulls → ""
  • INT_FIELDS = ["TC"] → must be int, nulls → 0
  • YEAR_FIELDS = ["PY"] → must be 4-digit int

New Files

www/services/etl/schemas.py

Defines the standard WoS schema: all required columns, their Python types, and null-handling rules.

www/services/etl/extractors.py

  • fetch_openalex(query, max_results) — queries OpenAlex REST API with pagination
  • fetch_pubmed(query, max_results) — queries PubMed via Biopython Entrez

www/services/etl/transformers.py

  • enforce_types(df) — enforces type contracts from schemas.py
  • ensure_columns(df) — adds missing required columns with empty defaults
  • add_sr_field(df) — generates SR key: "Surname Year JournalAbbr VVolume"
  • transform(df) — master dispatcher combining all steps

www/services/etl/validators.py

  • validate(df) — checks all required columns exist, correct types, no NaN/None
  • print_report(result) — prints human-readable validation report

www/services/etl/loader.py

  • load_standardized_csv(path) — reloads CSV with correct list parsing

Execution Evidence

Live API Query (OpenAlex)

=== Live API Query from OpenAlex ===
Query: machine learning | Max results: 5
Records fetched: 5

=== Normalized Output (key fields) ===
         DB                                          TI    PY     TC
0  OPENALEX  Scikit-learn: Machine Learning in Python  2012  63729
1  OPENALEX  Genetic algorithms in search...           1989  49334
2  OPENALEX  C4.5: Programs for Machine Learning       1992  23698
3  OPENALEX  UCI Machine Learning Repository           2007  24350
4  OPENALEX  Data Mining: Practical ML Tools           2011  25713

Standardized CSV Output

=== ETL Pipeline Output ===
Total records: 50
Columns: ['DB', 'UT', 'DI', 'PMID', 'TI', 'SO', 'JI', 'J9', 'PY', 'DT', 
          'LA', 'TC', 'AU', 'AF', 'C1', 'RP', 'CR', 'DE', 'ID', 'AB', 
          'VL', 'IS', 'BP', 'EP', 'SR', 'AU_UN', 'AU1_CO', 'C3']

First 3 rows:
DB        TI                                      PY    TC      SR
OPENALEX  Scikit-learn: Machine Learning...       2012  63727   Pedregosa 2012 arXiv VV0
OPENALEX  Genetic algorithms in search...         1989  49334   UNKNOWN 1989 Choice V27
OPENALEX  C4.5: Programs for Machine Learning     1992  23698   Quinlan 1992 UNKNOWNJ VV0

Dashboard Patches

1. OpenAlex Source Support

Added elif source == "openalex" in process_single_file() and pass-through handlers in all format_XX_column() functions in format_functions.py.

2. Reactive Value Fixes (app.py)

Fixed 30+ instances of dfdf.get() where Shiny reactive Values were incorrectly passed directly to analytical functions causing 'Value' object has no attribute 'columns' errors.

3. network_plot() Rewrite (networkplot.py)

Original networkPlot() only accepted weights parameter. Rewrote network_plot() wrapper to accept NetMatrix (adjacency DataFrame), build igraph network, compute layout, run community detection (walktrap), and return complete result dict with keys: graph, layout, cluster_obj, cluster_res, S, color.

4. Layout Attribute Fix

Fixed cocnet['graph']['layout']cocnet['layout'] (igraph Graph objects don't support dict-style attribute access) in:

  • get_co_occurence_network.py
  • get_cocitation.py
  • get_collaborationnetwork.py

5. Type Enforcement

  • NaN → integer errors: added pd.isna() guard in get_relevantauthors.py, get_citedcountries.py
  • .str accessor on float columns: added .astype(str) in thematicmap.py
  • TC/PY numeric coercion before analytical operations in get_citeddocuments.py, get_historiograph.py, couplingmap.py

6. Empty Matrix Handling

Added graceful return None instead of crash when network matrix is empty:

  • biblionetwork.py — empty collaboration/co-citation matrices
  • couplingmap.py — empty coupling networks
  • thematicmap.py — insufficient data warning
  • get_worldmapcollaboration.py — missing country affiliations

7. CSS Fix

Fixed style="width=100%"style="width:100%" in app.py (caused ITable rendering crash).


Dashboard Validation Results

Tested with OpenAlex data (50 documents, query: "machine learning"):

Dashboard Section Status
Data Upload (OpenAlex CSV)
Completeness Check
Main Information (Box + Table)
Annual Scientific Production
Average Citations per Year
Three-Field Plot
Most Relevant Sources
Bradford's Law
Sources Local Impact
Authors Production over Time
Lotka's Law
Most Relevant Affiliations
Most Global Cited Documents
Most Frequent Words
WordCloud
Co-occurrence Network
Thematic Map
Collaboration Network
Co-citation Network

Known Limitations (Data-dependent)

  • Historiograph / Local Citations: OpenAlex returns work IDs in referenced_works, not formatted citation strings. Functions requiring standard CR format produce empty results but do not crash.
  • Countries Collaboration: Requires co
Screenshot 2026-06-17 at 01 36 52 mplete affiliation data; OpenAlex coverage is partial. ``` Screenshot 2026-06-17 at 23 31 38 Screenshot 2026-06-17 at 23 33 20 Screenshot 2026-06-17 at 23 34 36

Contributors

  • Solmaz Nazari
  • Mostafa Rezvanifar
  • Arman Zanganeh

- Add ETL pipeline (extractors, transformers, validators, loader, schemas)
- Add OpenAlex and PubMed API extractors
- Add standardized CSV output compatible with bibliometrix-python dashboard
- Patch format_functions.py to support OpenAlex source
- Fix df.get() reactive Value issues throughout app.py
- Fix network_plot() to support NetMatrix parameter
- Fix layout attribute access in co-occurrence, collaboration, cocitation networks
- Fix NaN handling in citation counts and year fields
- Fix str accessor errors in thematicmap.py
- Add error handling for empty matrices in biblionetwork.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant