feat: Add ETL pipeline for OpenAlex/PubMed with dashboard fixes#23
Open
solmaznazari wants to merge 1 commit into
Open
feat: Add ETL pipeline for OpenAlex/PubMed with dashboard fixes#23solmaznazari wants to merge 1 commit into
solmaznazari wants to merge 1 commit into
Conversation
- Add ETL pipeline (extractors, transformers, validators, loader, schemas) - Add OpenAlex and PubMed API extractors - Add standardized CSV output compatible with bibliometrix-python dashboard - Patch format_functions.py to support OpenAlex source - Fix df.get() reactive Value issues throughout app.py - Fix network_plot() to support NetMatrix parameter - Fix layout attribute access in co-occurrence, collaboration, cocitation networks - Fix NaN handling in citation counts and year fields - Fix str accessor errors in thematicmap.py - Add error handling for empty matrices in biblionetwork.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Project Overview
This PR implements a robust, source-agnostic ETL pipeline for bibliometrix-python, enabling the Shiny dashboard to work with OpenAlex and PubMed data in addition to Web of Science.
Architecture
Dispatcher Pattern
The pipeline uses a single entry-point
transform()function that acts as a dispatcher — regardless of input source (OpenAlex, PubMed, Scopus, WoS), all data passes through the same transformation chain:Mapping Dictionary (Type Contracts)
Defined in
schemas.py:MULTI_VALUE_FIELDS= ["AU", "AF", "C1", "CR", "DE", "ID"] → must belist[str]STRING_FIELDS= ["TI", "SO", "AB", "SR", ...] → must bestr, nulls →""INT_FIELDS= ["TC"] → must beint, nulls →0YEAR_FIELDS= ["PY"] → must be 4-digit intNew Files
www/services/etl/schemas.pyDefines the standard WoS schema: all required columns, their Python types, and null-handling rules.
www/services/etl/extractors.pyfetch_openalex(query, max_results)— queries OpenAlex REST API with paginationfetch_pubmed(query, max_results)— queries PubMed via Biopython Entrezwww/services/etl/transformers.pyenforce_types(df)— enforces type contracts from schemas.pyensure_columns(df)— adds missing required columns with empty defaultsadd_sr_field(df)— generates SR key:"Surname Year JournalAbbr VVolume"transform(df)— master dispatcher combining all stepswww/services/etl/validators.pyvalidate(df)— checks all required columns exist, correct types, no NaN/Noneprint_report(result)— prints human-readable validation reportwww/services/etl/loader.pyload_standardized_csv(path)— reloads CSV with correct list parsingExecution Evidence
Live API Query (OpenAlex)
Standardized CSV Output
Dashboard Patches
1. OpenAlex Source Support
Added
elif source == "openalex"inprocess_single_file()and pass-through handlers in allformat_XX_column()functions informat_functions.py.2. Reactive Value Fixes (
app.py)Fixed 30+ instances of
df→df.get()where Shiny reactive Values were incorrectly passed directly to analytical functions causing'Value' object has no attribute 'columns'errors.3.
network_plot()Rewrite (networkplot.py)Original
networkPlot()only acceptedweightsparameter. Rewrotenetwork_plot()wrapper to acceptNetMatrix(adjacency DataFrame), build igraph network, compute layout, run community detection (walktrap), and return complete result dict with keys:graph,layout,cluster_obj,cluster_res,S,color.4. Layout Attribute Fix
Fixed
cocnet['graph']['layout']→cocnet['layout'](igraph Graph objects don't support dict-style attribute access) in:get_co_occurence_network.pyget_cocitation.pyget_collaborationnetwork.py5. Type Enforcement
pd.isna()guard inget_relevantauthors.py,get_citedcountries.py.str accessoron float columns: added.astype(str)inthematicmap.pyget_citeddocuments.py,get_historiograph.py,couplingmap.py6. Empty Matrix Handling
Added graceful
return Noneinstead of crash when network matrix is empty:biblionetwork.py— empty collaboration/co-citation matricescouplingmap.py— empty coupling networksthematicmap.py— insufficient data warningget_worldmapcollaboration.py— missing country affiliations7. CSS Fix
Fixed
style="width=100%"→style="width:100%"inapp.py(caused ITable rendering crash).Dashboard Validation Results
Tested with OpenAlex data (50 documents, query: "machine learning"):
Known Limitations (Data-dependent)
referenced_works, not formatted citation strings. Functions requiring standard CR format produce empty results but do not crash.Contributors