feat: OpenAlex/PubMed ETL pipeline with WoS schema standardization#24
Open
isx9 wants to merge 112 commits into
Open
feat: OpenAlex/PubMed ETL pipeline with WoS schema standardization#24isx9 wants to merge 112 commits into
isx9 wants to merge 112 commits into
Conversation
Brief description inside.
Description inside.
Description inside.
Finished www/services/
Completed for both OpenAlex and PubMed.
Added SR calculation to standardize() main function
Add validator.py with column, null and type checks
Copy SR function directly into standardizer to avoid relative import error
Solved bugs: - `LA` (PubMed) returned as a list `['eng']` by the API, needed to extract first element as string - `DI` (OpenAlex) returned as full URL `"https://doi.org/..."`, needed to strip the prefix. Also `None` for some records, needed `or ""` guard - `PY` (OpenAlex) returned as integer `2012`, needed to cast to string `"2012"`
…o dev # Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.
This reverts commit 8205ed2.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Group Members
Isabella Di Lorenzi — D03000224
Maria Pasconcino — D03000295
Summary
This PR adds an Extract → Transform → Validate pipeline (
bibliometrix-python-dev) that retrieves records from the OpenAlex and PubMed REST APIs and standardizes them into the Web of Science (WoS) schema expected by thebibliometrix-pythondashboard. It also documents and patches the dashboard-side bugs uncovered while validating that standardized output against every analytical function inservices/andfunctions/.1. Architecture
1.1 Pipeline overview
Each phase lives in its own module, with no monolithic function.
execution_evidence.ipynbwires them together asretrieve() → standardize() → validate(), the functional equivalent of the R package'sconvert2df(), which was the missing single entry point in the original codebase.1.2 The Dispatcher
Dispatch happens at exactly two points — the boundary of each phase — rather than being re-implemented by every downstream consumer (the failure mode in the legacy
format_functions.py, where every column-level formatter re-branches on source):fetch_openalex()/fetch_pubmed()each absorb that API's own pagination, ID-resolution, and rate-limiting, but both return the same shape (list[dict]).standardize_openalex()/standardize_pubmed()each own all null-handling, list-casting, and default-filling for their source internally — source-specific logic is contained to these two switch points instead of leaking into every caller.1.3 Mapping dictionaries
mappings.pyholds flatraw_field → WoS_tagdictionaries, applied viaapply_mapping():The dictionary covers only direct 1:1 renames. Each one is followed by inline comments enumerating the two exception classes that can't go in a lookup table: fields needing structural transformation (OpenAlex's nested
authorships,abstract_inverted_index,biblio; PubMed'sarticleids,pages), handled by dedicatedparse_*functions; and fields with no equivalent on that source (IDfor both,PMIDfor OpenAlex,TC/C1/DE/AU_COfor PubMed's eSummary response), explicitly filled with safe defaults rather than silently absent. Oneparse_*function goes further than reshaping: OpenAlex'sreferenced_worksis just a list of bare work IDs, soparse_openalex_references()hands those IDs toresolve_openalex_references(), which calls back out to the OpenAlex API and reconstructs proper"Smith J, 2019, NAT COMMUN"-style strings forCR— the one field that round-trips through a second API call rather than transforming data already in hand.1.4 Type contracts
validator.pydefines every WoS tag's required Python type and runs three checks against it:The null-handling convention (
""for scalars,[]for lists, neverNone/NaN) is enforced twice: insidestandardize_openalex()/standardize_pubmed()via an explicit sanitization pass, and again externally by the validator as a hard gate. A regression in either layer is still caught by the other.2. Files Changed
New files (
www/services/)api_retriever.pymappings.pyPUBMED_MAPPING/OPENALEX_MAPPINGdeclarative rename dictionariesstandardizer.pySRcomputationvalidator.py3. Debugging & Patches to Analytical Functions
3.1 Methodology
Once the ETL produced valid, schema-compliant
DataFrames, each was run through every relevant dashboard panel against a 200-row test set per source (test_openalex_200.csv,test_pubmed_200.csv). Every failure was classified before any fix was attempted:Only items in the first bucket were patched. Items in the second are documented in 3.3 so graders and future maintainers don't mistake "no API support" for an ETL defect. Full technical detail (exact line, root cause, fix) for every file below lives in
patching_documented.md.3.2 Files patched — confirmed fixes
By far the most common single fix, applied across nearly every file below: replacing
hasattr(df, "get")(which is also true for a plainpandas.DataFrame, since it has its own.get()method) withisinstance(df, pd.DataFrame)to correctly tell a Shiny reactive object apart from a plain DataFrame. The rest of each file's patches handle missing/empty columns, non-numericPY/TC, and degenerate network/matrix cases gracefully instead of crashing.www/services/biblionetwork.py,cocmatrix.py,couplingmap.py,format_functions.py,histnetwork.py,metatagextraction.py,networkplot.py,termextraction.py,thematicmap.pywww/functions/get_affiliationproductionovertime.py,get_annualproduction.py,get_authorlocalimpact.py,get_authorproductionovertime.py,get_averagecitations.py,get_bradfordlaw.py,get_citedcountries.py,get_citeddocuments.py,get_clusteringcoupling.py,get_co_occurence_network.py,get_cocitation.py,get_collaborationnetwork.py,get_correspondingauthorcountries.py,get_countriesproduction.py,get_countriesproductionovertime.py,get_factorialanalysis.py,get_filters.py,get_frequentwords.py,get_historiograph.py,get_localcitedauthors.py,get_localciteddocuments.py,get_localcitedreferences.py,get_localcitedsources.py,get_lotkalaw.py,get_maininformations.py,get_referencesspectroscopy.py,get_relevantaffiliations.py,get_relevantauthors.py,get_relevantsources.py,get_sourceslocalimpact.py,get_sourcesproduction.py,get_table.py,get_thematicevolution.py,get_thematicmap.py,get_threefieldplot.py,get_treemap.py,get_trendtopics.py,get_wordcloud.py,get_wordfrequency.py,get_worldmapcollaboration.pytabletag.pyrequired no patch (passes as-is on both sources).app.pyTI) field via aSilentException— removed that exclusion.3.3 Data-characteristic limitations (not bugs)
ID(Keywords Plus)WC(Subject Categories)table_tag()checksif tag not in M.columnsbefore accessing it, so every consuming function degrades the same wayIDdoes — a clean empty result, not a crash — verified by direct testing (WordCloud, Most Frequent Words, TreeMap) on both sourcesAU_UN(Author Institutions)UNIV,INST,COLL); OpenAlex'sraw_affiliation_stringsdon't follow that comma-segmented structure, so it largely fails to isolate clean institution names — the same underlying affiliation-format mismatch as theAU_CO/country case. For PubMed the empty result is expected for a different reason (C1is itself largely empty), but this hasn't been directly verifiedCR(Cited References)AB(Abstract)TC(Times Cited)0for every record, filled as a safe defaultPYdistribution4. Validation Summary
The patches in 3.2 bring the entire set of assigned
services/andfunctions/modules to crash-free, graceful execution on the 200-row test sets for both OpenAlex and PubMed (full per-function status inpatching_documented.md, modulo the four entries flagged in 3.3 as superseded). The remaining gaps are the data-characteristic limitations in 3.3 — empty or reduced output where the source itself doesn't provide the underlying information, not pipeline defects.5. Live Dashboard Integration (Bonus)
Beyond the standalone CSV deliverable, the API query feature is wired directly into the Shiny dashboard, satisfying the spec's optional bonus ("Integrate this API query feature directly into the Python Shiny interface").
Two changes were required:
www/services/__init__.pynow exports the three ETL entry points:Without this,
app.pycannot see these functions viafrom www.services import *and fails withname 'retrieve' is not defined.app.pyreplaces the previous placeholder panel ("🚧 API is under construction 🚧") with a working "API Query" panel: a text input for the search query, a platform selector (OpenAlex / PubMed), a record-count input, and a "Run Query" button. On click:The panel also renders the first rows of the normalized result (
AU,TI,PY,SO,TC,DB,SR) directly in the UI, satisfying the "print the first few normalized rows" requirement without a separate script.This path never touches a CSV file — data goes straight from
validate()into the shared reactivedf. The CSV-serialization conventions discussed elsewhere (e.g. the;delimiter for multi-value fields) apply only to the standalone CSV deliverable and to the dashboard's separate "Import Raw Data" upload panel, not to this live-query path.6. Execution Evidence
execution_evidence.ipynbis the step-by-step execution log required by the spec, run end to end against both APIs:df.head()previews printed at each step, to inspect the raw API response shape before standardization.test_openalex_200.csv/test_pubmed_200.csvwith the;delimiter convention described in Section 1.3.NaN) survives a full save/reload cycle — relevant for the dashboard's separate "Import Raw Data" upload panel, as opposed to the live in-memory path described in Section 5.Sample output from the notebook:
The same flow with
platform="pubmed", source="pubmed"producestest_pubmed_200.csv. Both CSVs can be uploaded directly to the dashboard after converting them in a XLSX format.