feat: ETL pipeline - convert2df for Scopus, Dimensions, PubMed by ALHassanMusleh · Pull Request #21 · PRAISELab-PicusLab/bibliometrix-python

ALHassanMusleh · 2026-06-17T12:15:08Z

🔬 ETL Pipeline: From Heterogeneous Bibliographic Data to a Unified Schema

ALHASAN M N MUSLEH-------D03000272

Course: Data Science 2025/2026 — Prof. Vincenzo Moscato
Project: Bibliometrix-Python ETL — Base Level + Advanced Level
Repository: Fork of PRAISELab-PicusLab/bibliometrix-python

📌 Summary

This Pull Request implements a complete ETL (Extract → Transform → Load) pipeline that makes Bibliometrix-Python fully source-agnostic. Previously, the dashboard only worked correctly with Web of Science (WoS) data. This implementation enables the same analytical capabilities for Scopus, Dimensions, and PubMed data — and adds automatic retrieval from PubMed and OpenAlex APIs without any manual file download.

The single public entry-point mirrors R-Bibliometrix's convert2df() function:

from www.services.standardizer import convert2df
df = convert2df("sources/Scopus/Scopus.csv",         source="scopus")

df = convert2df("sources/Dimensions/Dimensions.csv", source="dimensions")

df = convert2df("sources/PubMed/pubmed.txt",         source="pubmed")

✅ Verification Results

Running python verify_project.py confirms full correctness:

============================================================
  BIBLIOMETRIX-PYTHON ETL — PROJECT VERIFICATION
============================================================
▶  Step 1: Module imports              7/7  ✅

▶  Step 2-4: ETL Pipeline              3/3  ✅

▶  Step 5: Mandatory columns present   3/3  ✅

▶  Step 6: Type contracts enforced     9/9  ✅

▶  Step 7: No NaN/None remaining       3/3  ✅

▶  Step 8: SR column populated         3/3  ✅

▶  Step 9: Analytical functions       21/21 ✅

▶  Step 10: Export CSVs               3/3  ✅
RESULTS: 52/52 checks passed 🎉 PROJECT IS FULLY CORRECT AND READY FOR SUBMISSION

Source	Records	Columns	Status
Scopus CSV	1,000	34	✅ All checks passed
Dimensions CSV	500	34	✅ All checks passed
PubMed TXT	10,000	34	✅ All checks passed

🏗️ Architecture

The pipeline is split into 4 clean phases with separate modules — no monolithic functions:

Raw file (Scopus / Dimensions / PubMed)
         │
         ▼
┌─────────────────────┐
│  Phase 1 — EXTRACT  │  extractor.py      → Dispatcher pattern
└────────┬────────────┘
         ▼
┌──────────────────────┐
│  Phase 2 — TRANSFORM │  transformer.py   → Mapping dictionary + Type contracts
└────────┬─────────────┘
         ▼
┌─────────────────────────┐
│  Phase 3 — CALC FIELDS  │  field_calculator.py → SR key (reuses existing code)
└────────┬────────────────┘
         ▼
┌──────────────────────┐
│  Phase 4 — VALIDATE  │  validator.py     → Column + type + null checks
└──────────────────────┘
         │
         ▼
    Standardised DataFrame ✓

Dispatcher Pattern (EXTRACT)

A dictionary maps (source, extension) pairs to the correct loader. Adding a new database = one function + one dictionary entry:

_DISPATCHER = {
    ('scopus',     '.csv'):  _load_scopus_csv,
    ('dimensions', '.xlsx'): _load_dimensions_xlsx,
    ('pubmed',     '.txt'):  _load_pubmed_txt,
    ('wos',        '.txt'):  _load_wos_txt,
    ...
}

Mapping Dictionary (TRANSFORM)

Every WoS tag is mapped to the existing format_*_column() function — all per-source logic is reused, not rewritten:

_FIELD_BUILDERS = {
    'AU': format_au_column,   'TI': format_ti_column,
    'SO': format_so_column,   'CR': format_cr_column,
    'TC': format_tc_column,   'PY': format_py_column,
    ...  # all 32 WoS tags
}

Type Contracts

Column Group	Required Type	Null Replacement
AU, AF, C1, CR, DE, ID, EM...	`list[str]`	`[]`
TC	`int`	`0`
PY	`int` (numeric)	`0`
DB, TI, SO, SR, DI, LA, AB...	`str`	`""`

📁 New Files Added

File	Location	Role
`standardizer.py`	`www/services/`	Public entry-point `convert2df()`
`extractor.py`	`www/services/`	Phase 1 — dispatcher + loaders
`transformer.py`	`www/services/`	Phase 2 — rename + type contracts
`field_calculator.py`	`www/services/`	Phase 3 — SR derived field
`validator.py`	`www/services/`	Phase 4 — validation gate
`api_retriever.py`	`www/services/`	Advanced Level — PubMed & OpenAlex APIs
`verify_project.py`	root	52-check verification script
`demo_etl.py`	root	Step-by-step execution evidence

🔧 Debugging — Patches Applied to Existing Functions

Patch 1 — `functions/get_data.py`

Problem: Dashboard called biblio_json() on every CSV. ETL output files (WoS tags) crashed with KeyError: 'Abstract' because biblio_json() expects raw Scopus column names.
Fix: Added _is_etl_output() — detects WoS columns and loads directly, bypassing biblio_json().

Patch 2 — `functions/get_relevantauthors.py`

Problem: UI sends "n_docs" as frequency key. Code used it as column name → plotly drew zero-width bars.
Fix: Separated UI key from display label. Column always named "N. of Documents".

Patch 3 — `functions/get_trendtopics.py`

Problem 1: time_window (int) passed to len() → TypeError.
Problem 2: groupby().apply() dropped year_med column in newer pandas → plotly crash.
Fix: Convert int to None; use sort_values() + groupby().head().

Patch 4 — `functions/get_localcitedauthors.py`

Problem: histNetwork() returns None for non-WoS databases → TypeError: 'NoneType' is not subscriptable. Duplicate import go caused UnboundLocalError.
Fix: Added None guard with graceful empty chart. Removed duplicate import.

Patch 5 — `functions/get_countriesproduction.py`

Problem: Downloads world map from naciscdn.org → HTTP 403 Forbidden.
Fix: Added local file cache + GitHub fallback mirror + graceful empty chart if both fail.

Patch 6 — `www/services/histnetwork.py`

Problem: Checked db == "Scopus" (exact case). ETL sets DB = "SCOPUS" → all data returned None.
Fix: Case-insensitive check using db.upper(). Accepts SCOPUS, DIMENSIONS, PUBMED.

Patch 7 — `functions/get_database.py`

Problem: New "1D" API option caused UnboundLocalError: cannot access local variable 'database'.
Fix: Added elif input.select() == "1D" returning "API (PUBMED)" or "API (OPENALEX)".

Patch 8 — All 27 chart functions

Problem: All used go.FigureWidget and fig._config → NotImplementedError in Shiny with installed Plotly version.
Fix: Replaced with go.Figure and removed all fig._config blocks via automated regex patching.

📊 Dashboard Screenshots

🚀 Advanced Level — API Retrieval

PubMed API

Query: 'machine learning' → 100 records retrieved [Validator] ✓ 100 records passed all checks. TI AU PY SO

0 Machine-learning-assisted artificial ... [Xiong Y, Xiao J] 2026 Biosensors & bioelectronics 1 Authoritative Textbook-Augmented LLM ... [He K, Xiao Q] 2026 J Med Internet Research 2 Predicting Frailty Trajectories ... [Huang J, Fan Q] 2026 JMIR Aging

OpenAlex API

Query: 'bibliometrics' → 50 records retrieved [Validator] ✓ 50 records passed all checks. TI AU PY

0 Software survey: VOSviewer ... [Nees Jan van Eck, Ludo Waltman] 2009 1 How to conduct a bibliometric ... [Naveen Donthu, Satish Kumar] 2021 2 Bibliometric Methods in Management [Ivan Župič, Tomaž Čater] 2014

Dashboard Integration (Bonus ✅)

API retrieval is integrated directly into the Shiny dashboard:

Data → Import or Load
Select "Retrieve via API (PubMed / OpenAlex)"
Enter search query + platform + max results
Click "Retrieve & Load"
All analysis panels activate immediately — no file download needed

📚 References

R-Bibliometrix convert2df(): https://github.com/massimoaria/bibliometrix
PubMed E-utilities API: https://www.ncbi.nlm.nih.gov/books/NBK25500/
OpenAlex API: https://docs.openalex.org/

… + Advanced Level complete

feat: ETL pipeline - convert2df for Scopus, Dimensions, PubMed - Base…

e24ca0e

… + Advanced Level complete

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ETL pipeline - convert2df for Scopus, Dimensions, PubMed#21

feat: ETL pipeline - convert2df for Scopus, Dimensions, PubMed#21
ALHassanMusleh wants to merge 1 commit into
PRAISELab-PicusLab:mainfrom
ALHassanMusleh:main

ALHassanMusleh commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ALHassanMusleh commented Jun 17, 2026

🔬 ETL Pipeline: From Heterogeneous Bibliographic Data to a Unified Schema

📌 Summary

✅ Verification Results

RESULTS: 52/52 checks passed 🎉 PROJECT IS FULLY CORRECT AND READY FOR SUBMISSION

🏗️ Architecture

Dispatcher Pattern (EXTRACT)

Mapping Dictionary (TRANSFORM)

Type Contracts

📁 New Files Added

🔧 Debugging — Patches Applied to Existing Functions

Patch 1 — functions/get_data.py

Patch 2 — functions/get_relevantauthors.py

Patch 3 — functions/get_trendtopics.py

Patch 4 — functions/get_localcitedauthors.py

Patch 5 — functions/get_countriesproduction.py

Patch 6 — www/services/histnetwork.py

Patch 7 — functions/get_database.py

Patch 8 — All 27 chart functions

📊 Dashboard Screenshots

🚀 Advanced Level — API Retrieval

PubMed API

OpenAlex API

Dashboard Integration (Bonus ✅)

📚 References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Patch 1 — `functions/get_data.py`

Patch 2 — `functions/get_relevantauthors.py`

Patch 3 — `functions/get_trendtopics.py`

Patch 4 — `functions/get_localcitedauthors.py`

Patch 5 — `functions/get_countriesproduction.py`

Patch 6 — `www/services/histnetwork.py`

Patch 7 — `functions/get_database.py`