Skip to content

feat: ETL pipeline - convert2df for Scopus, Dimensions, PubMed#21

Open
ALHassanMusleh wants to merge 1 commit into
PRAISELab-PicusLab:mainfrom
ALHassanMusleh:main
Open

feat: ETL pipeline - convert2df for Scopus, Dimensions, PubMed#21
ALHassanMusleh wants to merge 1 commit into
PRAISELab-PicusLab:mainfrom
ALHassanMusleh:main

Conversation

@ALHassanMusleh

Copy link
Copy Markdown

🔬 ETL Pipeline: From Heterogeneous Bibliographic Data to a Unified Schema

ALHASAN M N MUSLEH-------D03000272

Course: Data Science 2025/2026 — Prof. Vincenzo Moscato
Project: Bibliometrix-Python ETL — Base Level + Advanced Level
Repository: Fork of PRAISELab-PicusLab/bibliometrix-python


📌 Summary

This Pull Request implements a complete ETL (Extract → Transform → Load) pipeline that makes Bibliometrix-Python fully source-agnostic. Previously, the dashboard only worked correctly with Web of Science (WoS) data. This implementation enables the same analytical capabilities for Scopus, Dimensions, and PubMed data — and adds automatic retrieval from PubMed and OpenAlex APIs without any manual file download.

The single public entry-point mirrors R-Bibliometrix's convert2df() function:

from www.services.standardizer import convert2df

df = convert2df("sources/Scopus/Scopus.csv", source="scopus")
df = convert2df("sources/Dimensions/Dimensions.csv", source="dimensions")
df = convert2df("sources/PubMed/pubmed.txt", source="pubmed")


✅ Verification Results

Running python verify_project.py confirms full correctness:

============================================================
  BIBLIOMETRIX-PYTHON ETL — PROJECT VERIFICATION
============================================================

▶ Step 1: Module imports 7/7 ✅
▶ Step 2-4: ETL Pipeline 3/3 ✅
▶ Step 5: Mandatory columns present 3/3 ✅
▶ Step 6: Type contracts enforced 9/9 ✅
▶ Step 7: No NaN/None remaining 3/3 ✅
▶ Step 8: SR column populated 3/3 ✅
▶ Step 9: Analytical functions 21/21 ✅
▶ Step 10: Export CSVs 3/3 ✅

RESULTS: 52/52 checks passed 🎉 PROJECT IS FULLY CORRECT AND READY FOR SUBMISSION

Source Records Columns Status
Scopus CSV 1,000 34 ✅ All checks passed
Dimensions CSV 500 34 ✅ All checks passed
PubMed TXT 10,000 34 ✅ All checks passed

🏗️ Architecture

The pipeline is split into 4 clean phases with separate modules — no monolithic functions:

Raw file (Scopus / Dimensions / PubMed)
         │
         ▼
┌─────────────────────┐
│  Phase 1 — EXTRACT  │  extractor.py      → Dispatcher pattern
└────────┬────────────┘
         ▼
┌──────────────────────┐
│  Phase 2 — TRANSFORM │  transformer.py   → Mapping dictionary + Type contracts
└────────┬─────────────┘
         ▼
┌─────────────────────────┐
│  Phase 3 — CALC FIELDS  │  field_calculator.py → SR key (reuses existing code)
└────────┬────────────────┘
         ▼
┌──────────────────────┐
│  Phase 4 — VALIDATE  │  validator.py     → Column + type + null checks
└──────────────────────┘
         │
         ▼
    Standardised DataFrame ✓

Dispatcher Pattern (EXTRACT)

A dictionary maps (source, extension) pairs to the correct loader. Adding a new database = one function + one dictionary entry:

_DISPATCHER = {
    ('scopus',     '.csv'):  _load_scopus_csv,
    ('dimensions', '.xlsx'): _load_dimensions_xlsx,
    ('pubmed',     '.txt'):  _load_pubmed_txt,
    ('wos',        '.txt'):  _load_wos_txt,
    ...
}

Mapping Dictionary (TRANSFORM)

Every WoS tag is mapped to the existing format_*_column() function — all per-source logic is reused, not rewritten:

_FIELD_BUILDERS = {
    'AU': format_au_column,   'TI': format_ti_column,
    'SO': format_so_column,   'CR': format_cr_column,
    'TC': format_tc_column,   'PY': format_py_column,
    ...  # all 32 WoS tags
}

Type Contracts

Column Group Required Type Null Replacement
AU, AF, C1, CR, DE, ID, EM... list[str] []
TC int 0
PY int (numeric) 0
DB, TI, SO, SR, DI, LA, AB... str ""

📁 New Files Added

File Location Role
standardizer.py www/services/ Public entry-point convert2df()
extractor.py www/services/ Phase 1 — dispatcher + loaders
transformer.py www/services/ Phase 2 — rename + type contracts
field_calculator.py www/services/ Phase 3 — SR derived field
validator.py www/services/ Phase 4 — validation gate
api_retriever.py www/services/ Advanced Level — PubMed & OpenAlex APIs
verify_project.py root 52-check verification script
demo_etl.py root Step-by-step execution evidence


🔧 Debugging — Patches Applied to Existing Functions

Patch 1 — functions/get_data.py

Problem: Dashboard called biblio_json() on every CSV. ETL output files (WoS tags) crashed with KeyError: 'Abstract' because biblio_json() expects raw Scopus column names.
Fix: Added _is_etl_output() — detects WoS columns and loads directly, bypassing biblio_json().

Patch 2 — functions/get_relevantauthors.py

Problem: UI sends "n_docs" as frequency key. Code used it as column name → plotly drew zero-width bars.
Fix: Separated UI key from display label. Column always named "N. of Documents".

Patch 3 — functions/get_trendtopics.py

Problem 1: time_window (int) passed to len()TypeError.
Problem 2: groupby().apply() dropped year_med column in newer pandas → plotly crash.
Fix: Convert int to None; use sort_values() + groupby().head().

Patch 4 — functions/get_localcitedauthors.py

Problem: histNetwork() returns None for non-WoS databases → TypeError: 'NoneType' is not subscriptable. Duplicate import go caused UnboundLocalError.
Fix: Added None guard with graceful empty chart. Removed duplicate import.

Patch 5 — functions/get_countriesproduction.py

Problem: Downloads world map from naciscdn.org → HTTP 403 Forbidden.
Fix: Added local file cache + GitHub fallback mirror + graceful empty chart if both fail.

Patch 6 — www/services/histnetwork.py

Problem: Checked db == "Scopus" (exact case). ETL sets DB = "SCOPUS" → all data returned None.
Fix: Case-insensitive check using db.upper(). Accepts SCOPUS, DIMENSIONS, PUBMED.

Patch 7 — functions/get_database.py

Problem: New "1D" API option caused UnboundLocalError: cannot access local variable 'database'.
Fix: Added elif input.select() == "1D" returning "API (PUBMED)" or "API (OPENALEX)".

Patch 8 — All 27 chart functions

Problem: All used go.FigureWidget and fig._configNotImplementedError in Shiny with installed Plotly version.
Fix: Replaced with go.Figure and removed all fig._config blocks via automated regex patching.


📊 Dashboard Screenshots

image image image

🚀 Advanced Level — API Retrieval

PubMed API

Query: 'machine learning' → 100 records retrieved
[Validator] ✓  100 records passed all checks.
                                     TI                      AU    PY                         SO

0 Machine-learning-assisted artificial ... [Xiong Y, Xiao J] 2026 Biosensors & bioelectronics
1 Authoritative Textbook-Augmented LLM ... [He K, Xiao Q] 2026 J Med Internet Research
2 Predicting Frailty Trajectories ... [Huang J, Fan Q] 2026 JMIR Aging

OpenAlex API

Query: 'bibliometrics' → 50 records retrieved
[Validator] ✓  50 records passed all checks.
                                     TI                              AU    PY

0 Software survey: VOSviewer ... [Nees Jan van Eck, Ludo Waltman] 2009
1 How to conduct a bibliometric ... [Naveen Donthu, Satish Kumar] 2021
2 Bibliometric Methods in Management [Ivan Župič, Tomaž Čater] 2014

Dashboard Integration (Bonus ✅)

API retrieval is integrated directly into the Shiny dashboard:

  1. Data → Import or Load
  2. Select "Retrieve via API (PubMed / OpenAlex)"
  3. Enter search query + platform + max results
  4. Click "Retrieve & Load"
  5. All analysis panels activate immediately — no file download needed

📚 References

  • R-Bibliometrix convert2df(): https://github.com/massimoaria/bibliometrix
  • PubMed E-utilities API: https://www.ncbi.nlm.nih.gov/books/NBK25500/
  • OpenAlex API: https://docs.openalex.org/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant