feat: Source-agnostic ETL pipeline — Scopus, Dimensions, Lens, Cochrane, PubMed API, OpenAlex API#20
Open
qmmrjaved-hue wants to merge 1 commit into
Open
Conversation
…ulti-source file support
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Authors
Course: Data Science — AY 2025/2026
Professor: Prof. Vincenzo Moscato
University: Università degli Studi di Napoli Federico II (UNINA)
Bibliometrix Python — Source-Agnostic ETL Pipeline
Overview
This pull request implements a robust ETL (Extract → Transform → Validate → Load) pipeline
for the Bibliometrix Python port, making it fully source-agnostic. The current implementation
only works reliably with Web of Science data. This contribution completes all five sources
marked "in progress" in the repository and adds OpenAlex API support which did not exist at all.
All 10 supported sources produce a standardized DataFrame in the WoS Field Tag schema
(
DB,UT,TI,AU,AF,PY,SO,JI,TC,CR,DE,ID,AB,C1,RP,DI,PMID,DT,LA,VL,IS,BP,EP,SR) so every existinganalytical function in
www/services/andwww/functions/runs without modification.What Was Added
New Modules (
www/services/)mapping_dicts.pyMANDATORY_COLUMNS,LIST_FIELDS,SCALAR_FIELDS,SOURCE_TO_DB)standardizer.pyload_file(),detect_source(),rename_columns(),enforce_types(),handle_nulls(),add_calculated_fields(),run_pipeline(),export_to_csv()validator.pyvalidate(df)andValidationErrorapi_retriever.pyfetch_pubmed(),fetch_openalex()with pagination and exponential-backoff retryNew Dashboard (
dashboard/)dashboard/app.pyNew Tests & Config
tests/test_etl.pypytest.iniintegration,file_sources)Source Coverage
API-Automated (no manual file download required)
esearch+efetch(MEDLINE format)https://api.openalex.org/workswith cursor paginationBoth sources handle pagination, rate limiting, and exponential backoff retries automatically.
The user provides only a text query and result count.
File-Based (user uploads an exported file)
Smoke-tested record counts (real sample files from
sources/)SCOPUSSCOPUSWOSWOSDIMENSIONSLENSCOCHRANEPUBMEDArchitecture
1. Mapping Dictionaries — single source of truth (
mapping_dicts.py)Every source has its own dictionary mapping source-native column names to WoS Field Tags.
Column names are never hardcoded anywhere else in the codebase.
2. Auto-detection (
standardizer.py → detect_source())detect_source()identifies the source from:DBcolumn value (set byload_file()or the API retrievers){"EID", "Authors", "Source title"}→SCOPUS_CSV)load_file()auto-detects format from file extension and content sampling (first 800 chars),or accepts an explicit
source=override.3. Type Contracts (
standardizer.py → enforce_types())Every output field has a strict enforced type — zero
NaNorNoneis guaranteed:AU, AF, C1, CR, DE, IDlist[str][]TI, SO, AB, DI, UT, DT, LA, RP, JI, VL, IS, BP, EP, PMIDstr""PYstr(4-digit year extracted from full date string)""TCint0DBstr(set fromSOURCE_TO_DB— e.g.SCOPUS_CSV→"SCOPUS")SRstr(calculated short reference)Source-specific pre-processing is contained in
enforce_types():"Author A and Author B"→ split on" and "→list[str]notefield via regex ("Cited by: N; ...")urlfield via regex (?eid=2-s2.0-...)"Name (Institution)")IS(ISSN) remapped toSN;IP(issue) remapped toISIDfield is the record identifier, not index keywords — renamed before list splitting4. Validation (
validator.py)validate(df)checks three contracts and returns a structured report dict:NaNorNoneremains in any columnAU,AF,C1,CR,DE,ID) arelist[str]On failure it raises
ValidationError("<column_name>: <reason>").The report dict is consumed by the Streamlit dashboard's Validation tab.
5. SR Calculated Field
The Short Reference (
SR) field is computed by calling the existingSR(M)functionfrom
www/services/metatagextraction.py— it was not rewritten. A faithful fallbackis used only when that module cannot be imported (e.g. Shiny-specific dependencies absent
in the Streamlit environment).
Patches Applied to Existing Functions
Three files contained hardcoded Web of Science or Scopus-only logic that caused
crashes on data from other sources. Each was patched in-place with a
# PATCHED:comment.Nothing was deleted or rewritten.
www/services/histnetwork.pywww/services/biblionetwork.pyBug fix —
db_name == "SCOPUS"never matched because the Shiny app passes"Scopus"(mixed case). Fixed with.upper():label_short()extension — added all new sources to citation label formatting:www/services/metatagextraction.pyStreamlit Dashboard (
dashboard/app.py)A standalone five-tab dashboard — separate from the existing Shiny app, which is untouched.
Design constraints enforced: no emojis anywhere, DM Sans font via Google Fonts, deep purple
#2e1760sidebar,#7c3aedaccent, white card panels, all charts viaplotly.graph_objects.The dashboard stubs out
www.services.utilsat import time soparsers.pycan be loadedwithout the Shiny-specific dependencies (
prince,igraph,faicons) being installed.API Evidence
PubMed query —
"lactic acid bacteria fermentation"(10 results, truncated):OpenAlex query —
"riboflavin biofortification"(10 results, truncated):Both outputs pass
validate()and are exported todata/outputs/with;-delimitedmulti-value fields.
Testing
tests/test_etl.pycontains 62 tests in three groups:Unit tests (25) — no network, no file I/O:
detect_source()identifies PubMed and OpenAlex correctlyrename_columns()maps source tags to WoS Field Tags for every sourceenforce_types()produceslist[str]for list fields, 4-digitPY,intTC, clean DOIhandle_nulls()eliminates all NaN valuesvalidate()passes on good data, raisesValidationErroron missing column / NaN / wrong typeFile-source tests (37) — requires
sources/directory:load_file()smoke tests for all 8 file formatstest_pipeline_file_source[<source>]— full ETL pipeline for eachtest_validate_file_output[<source>]—validate()must pass for eachexport_to_csv()Integration tests (2) — live API calls:
fetch_pubmed()andfetch_openalex()return non-empty DataFramesRun with:
Existing Shiny Dashboard Compatibility
No changes were made to
www/app.pyor any file inwww/functions/. The standardizedCSV output from all 10 sources was verified to be compatible with the Shiny app's data
loading path. The three patched files in
www/services/are backwards-compatible — theoriginal WoS and Scopus branches are preserved exactly, new branches were only added.
How to Run the Streamlit Dashboard
Navigate to
http://localhost:8501.