Python Full-Stack Data Engineer — NLP Scoring Web App · FastAPI · Streamlit · Parquet · scikit-learn

Posted 5 days ago

Worldwide

Summary

Project Overview: We are building an intelligent web platform that helps researchers, PhD students, and academics find the best open-access journals to publish their scientific articles. The system scores and ranks 22,890+ journals from the DOAJ directory using a multi-block scoring model, NLP semantic matching, and external data sources (SCImago SJR, OpenAlex API, Beall's List). The platform has two modes: Exploration mode: interactive dashboard with world map, scatter plots, heatmaps and cluster filters — no article needed. Consultation mode: the user inputs a title, topic area and/or abstract and receives a personalized ranking of the top 10 most compatible journals with scores, APC cost in USD, estimated publication weeks, SCImago quartile, and confidence level of the match. What we need you to build (8-week MVP): Sprint 1 — Data ingestion (DELIVERED): The base script ingest.py is already built and tested. It loads the DOAJ CSV (22,890 rows, 52 columns), cleans all columns, normalizes scores to 0-1 scale, parses APC amounts from multiple currencies to USD, and outputs journals_clean.parquet. You start from Sprint 2. Sprint 2 — External enrichment: Cross-reference journals by ISSN with SCImago SJR CSV (quartile, H-index, SJR score), OpenAlex API (articles/year, Scopus/WoS indexing), and Beall's List (predatory journal flag). Output: journals_enriched.parquet. Sprint 3 — Scoring engine + clusters: Calculate weighted score: B3 (editorial process) 25% + B4 (quality signals) 20% + B5 (cost control) 15%. Apply hard exclusion filter for predatory journals. Run 5 KMeans clusters (APC, rigor, topic, geography, speed). Output: journals_scored.parquet — the master file that feeds everything else. Sprint 4 — NLP matching engine: Build adaptive input detector (title only / title + area / title + area + abstract). Implement 3-level NLP strategy: keyword extraction with KeyBERT or spaCy for level 1, LCC code mapping for level 2, sentence-transformers embeddings + cosine similarity for level 3. Pre-calculate and cache all journal scope embeddings. Model: all-MiniLM-L6-v2. Sprint 5 — Exploration mode (Streamlit): Build interactive dashboard with sidebar filters (topic area, country, quartile, APC tier, weeks, language, toggles), world bubble map with Plotly, APC vs Score scatter plot, area x country heatmap, filterable table with CSV export. Sprint 6 — Consultation mode (Streamlit): Adaptive form with title required and area + abstract optional. Priority sliders for speed / prestige / cost with weights that sum to 100%. Top-10 results with score breakdown bars, APC in USD, weeks, quartile, CC license, confidence badge (Low / Medium / High), direct DOAJ link, PDF and CSV export. Sprint 7 — REST API + auth + roles: FastAPI backend with JWT authentication. Endpoints: POST /match, GET /journal by ISSN, GET /explore, POST /auth/login, GET /history. SQLite database with tables for users, searches and usage tracking. Two roles: client (20 queries/month limit) and internal (unlimited). Connect Streamlit to the API. Sprint 8 — Deploy + automation + docs: Deploy FastAPI and Streamlit to Render.com or Railway.app (free tier). Monthly pipeline automation script that auto-downloads the new DOAJ CSV and regenerates the scored parquet. Full README documentation. Tech stack required (mandatory): Python 3.10+, pandas, numpy, pyarrow, scikit-learn, sentence-transformers, FastAPI, Streamlit, Plotly, requests, BeautifulSoup4, SQLite, SQLAlchemy, JWT auth, Git Desirable (not blocking): umap-learn, KeyBERT or spaCy, Render or Railway deploy experience, basic NLP background What we provide: Full technical specification document (Word, 11 sections) with architecture diagrams, complete data schema, scoring formulas, NLP strategy per input level and acceptance criteria per sprint Sprint 1 code (ingest.py) already built, tested and documented — ready to run DOAJ CSV file (22,890 journals, May 2026) Clear acceptance criteria per sprint for milestone payments Engagement model: Fixed-price contract with milestone payments per sprint delivery. Each sprint has defined acceptance criteria that must pass before payment is released. Ideal candidate: 3 to 5 years Python experience with data engineering projects Has built at least one FastAPI or Flask API with authentication Has worked with Streamlit or similar data web frameworks Comfortable with pandas, parquet files and scikit-learn Basic NLP experience with embeddings and similarity — not deep research required Strong English written communication for async collaboration Available for a short video call before contract start To apply, please answer: Have you worked with sentence-transformers or similar embedding libraries? Share a brief example. Have you deployed a FastAPI and Streamlit application before? Where? What is your estimated timeline for Sprint 2 (external enrichment)? Your fixed-price proposal for the full 8-sprint MVP.

Less than 30 hrs/week
Hourly
1-3 months
Duration
Intermediate
Experience Level
$8.00
-
$15.00
Hourly
Remote Job
One-time project
Project Type

Contract-to-hire opportunity

This lets talent know that this job could become full time.
Learn more

Skills and Expertise

Mandatory skills

Data Engineering

Data Preprocessing

Activity on this job

Proposals:5 to 10
Last viewed by client:5 days ago
Interviewing:
0
Invites sent:
0
Unanswered invites:
0

About the client

Member since Jan 18, 2026

Peru
4:47 AM

Explore similar jobs on Upwork

Database University AssignmentsHourly‐ Posted 8 months ago

SQL

Database

Microsoft Excel

Database Design

Database Management

SQL Server Integration Services

Excel Macros

Excel Formula

Microsoft Power BI

Microsoft Excel PowerPivot

Power Query

Data Entry

Data Cleaning

Data Analytics

Data Extraction

Airflow Upgrade Specialist Needed for Migration to Airflow 3Hourly‐ Posted 4 weeks ago

Python

ETL Pipeline

Linux System Administration

Linux

System Administration

How it works

Create your free profile
Highlight your skills and experience, show your portfolio, and set your ideal pay rate.
Work the way you want
Apply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
Get paid securely
From contract to payment, we help you work safely and get paid securely.