Python Full-Stack Data Engineer — NLP Scoring Web App · FastAPI · Streamlit · Parquet · scikit-learn
Worldwide
Project Overview: We are building an intelligent web platform that helps researchers, PhD students, and academics find the best open-access journals to publish their scientific articles. The system scores and ranks 22,890+ journals from the DOAJ directory using a multi-block scoring model, NLP semantic matching, and external data sources (SCImago SJR, OpenAlex API, Beall's List). The platform has two modes: Exploration mode: interactive dashboard with world map, scatter plots, heatmaps and cluster filters — no article needed. Consultation mode: the user inputs a title, topic area and/or abstract and receives a personalized ranking of the top 10 most compatible journals with scores, APC cost in USD, estimated publication weeks, SCImago quartile, and confidence level of the match. What we need you to build (8-week MVP): Sprint 1 — Data ingestion (DELIVERED): The base script ingest.py is already built and tested. It loads the DOAJ CSV (22,890 rows, 52 columns), cleans all columns, normalizes scores to 0-1 scale, parses APC amounts from multiple currencies to USD, and outputs journals_clean.parquet. You start from Sprint 2. Sprint 2 — External enrichment: Cross-reference journals by ISSN with SCImago SJR CSV (quartile, H-index, SJR score), OpenAlex API (articles/year, Scopus/WoS indexing), and Beall's List (predatory journal flag). Output: journals_enriched.parquet. Sprint 3 — Scoring engine + clusters: Calculate weighted score: B3 (editorial process) 25% + B4 (quality signals) 20% + B5 (cost control) 15%. Apply hard exclusion filter for predatory journals. Run 5 KMeans clusters (APC, rigor, topic, geography, speed). Output: journals_scored.parquet — the master file that feeds everything else. Sprint 4 — NLP matching engine: Build adaptive input detector (title only / title + area / title + area + abstract). Implement 3-level NLP strategy: keyword extraction with KeyBERT or spaCy for level 1, LCC code mapping for level 2, sentence-transformers embeddings + cosine similarity for level 3. Pre-calculate and cache all journal scope embeddings. Model: all-MiniLM-L6-v2. Sprint 5 — Exploration mode (Streamlit): Build interactive dashboard with sidebar filters (topic area, country, quartile, APC tier, weeks, language, toggles), world bubble map with Plotly, APC vs Score scatter plot, area x country heatmap, filterable table with CSV export. Sprint 6 — Consultation mode (Streamlit): Adaptive form with title required and area + abstract optional. Priority sliders for speed / prestige / cost with weights that sum to 100%. Top-10 results with score breakdown bars, APC in USD, weeks, quartile, CC license, confidence badge (Low / Medium / High), direct DOAJ link, PDF and CSV export. Sprint 7 — REST API + auth + roles: FastAPI backend with JWT authentication. Endpoints: POST /match, GET /journal by ISSN, GET /explore, POST /auth/login, GET /history. SQLite database with tables for users, searches and usage tracking. Two roles: client (20 queries/month limit) and internal (unlimited). Connect Streamlit to the API. Sprint 8 — Deploy + automation + docs: Deploy FastAPI and Streamlit to Render.com or Railway.app (free tier). Monthly pipeline automation script that auto-downloads the new DOAJ CSV and regenerates the scored parquet. Full README documentation. Tech stack required (mandatory): Python 3.10+, pandas, numpy, pyarrow, scikit-learn, sentence-transformers, FastAPI, Streamlit, Plotly, requests, BeautifulSoup4, SQLite, SQLAlchemy, JWT auth, Git Desirable (not blocking): umap-learn, KeyBERT or spaCy, Render or Railway deploy experience, basic NLP background What we provide: Full technical specification document (Word, 11 sections) with architecture diagrams, complete data schema, scoring formulas, NLP strategy per input level and acceptance criteria per sprint Sprint 1 code (ingest.py) already built, tested and documented — ready to run DOAJ CSV file (22,890 journals, May 2026) Clear acceptance criteria per sprint for milestone payments Engagement model: Fixed-price contract with milestone payments per sprint delivery. Each sprint has defined acceptance criteria that must pass before payment is released. Ideal candidate: 3 to 5 years Python experience with data engineering projects Has built at least one FastAPI or Flask API with authentication Has worked with Streamlit or similar data web frameworks Comfortable with pandas, parquet files and scikit-learn Basic NLP experience with embeddings and similarity — not deep research required Strong English written communication for async collaboration Available for a short video call before contract start To apply, please answer: Have you worked with sentence-transformers or similar embedding libraries? Share a brief example. Have you deployed a FastAPI and Streamlit application before? Where? What is your estimated timeline for Sprint 2 (external enrichment)? Your fixed-price proposal for the full 8-sprint MVP.
- Less than 30 hrs/weekHourly
- 1-3 monthsDuration
- IntermediateExperience Level
$8.00
-
$15.00
Hourly- Remote Job
- One-time projectProject Type
Skills and Expertise
Activity on this job
- Proposals:5 to 10
- Last viewed by client:5 days ago
- Interviewing:0
- Invites sent:0
- Unanswered invites:0
About the client
- Peru4:47 AM
Explore similar jobs on Upwork
How it works
Create your free profileHighlight your skills and experience, show your portfolio, and set your ideal pay rate.
Work the way you wantApply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
Get paid securelyFrom contract to payment, we help you work safely and get paid securely.
About Upwork
- 4.9/5(Average rating of clients by professionals)
- G2 2021#1 freelance platform
- 49,000+Signed contract every week
- $2.3BFreelancers earned on Upwork in 2020
Find the best freelance jobs
Growing your career is as easy as creating a free profile and finding work like this that fits your skills.
Trusted by