feat: ETL pipeline - convert2df for Scopus, Dimensions, PubMed#21
Open
ALHassanMusleh wants to merge 1 commit into
Open
feat: ETL pipeline - convert2df for Scopus, Dimensions, PubMed#21ALHassanMusleh wants to merge 1 commit into
ALHassanMusleh wants to merge 1 commit into
Conversation
… + Advanced Level complete
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🔬 ETL Pipeline: From Heterogeneous Bibliographic Data to a Unified Schema
ALHASAN M N MUSLEH-------D03000272
Course: Data Science 2025/2026 — Prof. Vincenzo Moscato
Project: Bibliometrix-Python ETL — Base Level + Advanced Level
Repository: Fork of
PRAISELab-PicusLab/bibliometrix-python📌 Summary
This Pull Request implements a complete ETL (Extract → Transform → Load) pipeline that makes Bibliometrix-Python fully source-agnostic. Previously, the dashboard only worked correctly with Web of Science (WoS) data. This implementation enables the same analytical capabilities for Scopus, Dimensions, and PubMed data — and adds automatic retrieval from PubMed and OpenAlex APIs without any manual file download.
The single public entry-point mirrors R-Bibliometrix's
convert2df()function:✅ Verification Results
Running
python verify_project.pyconfirms full correctness:🏗️ Architecture
The pipeline is split into 4 clean phases with separate modules — no monolithic functions:
Dispatcher Pattern (EXTRACT)
A dictionary maps
(source, extension)pairs to the correct loader. Adding a new database = one function + one dictionary entry:Mapping Dictionary (TRANSFORM)
Every WoS tag is mapped to the existing
format_*_column()function — all per-source logic is reused, not rewritten:Type Contracts
list[str][]int0int(numeric)0str""📁 New Files Added
standardizer.pywww/services/convert2df()extractor.pywww/services/transformer.pywww/services/field_calculator.pywww/services/validator.pywww/services/api_retriever.pywww/services/verify_project.pydemo_etl.py🔧 Debugging — Patches Applied to Existing Functions
Patch 1 —
functions/get_data.pyProblem: Dashboard called
biblio_json()on every CSV. ETL output files (WoS tags) crashed withKeyError: 'Abstract'becausebiblio_json()expects raw Scopus column names.Fix: Added
_is_etl_output()— detects WoS columns and loads directly, bypassingbiblio_json().Patch 2 —
functions/get_relevantauthors.pyProblem: UI sends
"n_docs"as frequency key. Code used it as column name → plotly drew zero-width bars.Fix: Separated UI key from display label. Column always named
"N. of Documents".Patch 3 —
functions/get_trendtopics.pyProblem 1:
time_window(int) passed tolen()→TypeError.Problem 2:
groupby().apply()droppedyear_medcolumn in newer pandas → plotly crash.Fix: Convert int to
None; usesort_values() + groupby().head().Patch 4 —
functions/get_localcitedauthors.pyProblem:
histNetwork()returnsNonefor non-WoS databases →TypeError: 'NoneType' is not subscriptable. Duplicateimport gocausedUnboundLocalError.Fix: Added
Noneguard with graceful empty chart. Removed duplicate import.Patch 5 —
functions/get_countriesproduction.pyProblem: Downloads world map from
naciscdn.org→ HTTP 403 Forbidden.Fix: Added local file cache + GitHub fallback mirror + graceful empty chart if both fail.
Patch 6 —
www/services/histnetwork.pyProblem: Checked
db == "Scopus"(exact case). ETL setsDB = "SCOPUS"→ all data returnedNone.Fix: Case-insensitive check using
db.upper(). AcceptsSCOPUS,DIMENSIONS,PUBMED.Patch 7 —
functions/get_database.pyProblem: New
"1D"API option causedUnboundLocalError: cannot access local variable 'database'.Fix: Added
elif input.select() == "1D"returning"API (PUBMED)"or"API (OPENALEX)".Patch 8 — All 27 chart functions
Problem: All used
go.FigureWidgetandfig._config→NotImplementedErrorin Shiny with installed Plotly version.Fix: Replaced with
go.Figureand removed allfig._configblocks via automated regex patching.📊 Dashboard Screenshots
🚀 Advanced Level — API Retrieval
PubMed API
OpenAlex API
Dashboard Integration (Bonus ✅)
API retrieval is integrated directly into the Shiny dashboard:
📚 References
convert2df(): https://github.com/massimoaria/bibliometrix