Python Full-Stack Data Engineer — NLP Scoring Web App · FastAPI · Streamlit · Parquet · scikit-learn

Posted 5 days ago

Worldwide

Summary

Project Overview: We are building an intelligent web platform that helps researchers, PhD students, and academics find the best open-access journals to publish their scientific articles. The system scores and ranks 22,890+ journals from the DOAJ directory using a multi-block scoring model, NLP semantic matching, and external data sources (SCImago SJR, OpenAlex API, Beall's List). The platform has two modes: Exploration mode: interactive dashboard with world map, scatter plots, heatmaps and cluster filters — no article needed. Consultation mode: the user inputs a title, topic area and/or abstract and receives a personalized ranking of the top 10 most compatible journals with scores, APC cost in USD, estimated publication weeks, SCImago quartile, and confidence level of the match. What we need you to build (8-week MVP): Sprint 1 — Data ingestion (DELIVERED): The base script ingest.py is already built and tested. It loads the DOAJ CSV (22,890 rows, 52 columns), cleans all columns, normalizes scores to 0-1 scale, parses APC amounts from multiple currencies to USD, and outputs journals_clean.parquet. You start from Sprint 2. Sprint 2 — External enrichment: Cross-reference journals by ISSN with SCImago SJR CSV (quartile, H-index, SJR score), OpenAlex API (articles/year, Scopus/WoS indexing), and Beall's List (predatory journal flag). Output: journals_enriched.parquet. Sprint 3 — Scoring engine + clusters: Calculate weighted score: B3 (editorial process) 25% + B4 (quality signals) 20% + B5 (cost control) 15%. Apply hard exclusion filter for predatory journals. Run 5 KMeans clusters (APC, rigor, topic, geography, speed). Output: journals_scored.parquet — the master file that feeds everything else. Sprint 4 — NLP matching engine: Build adaptive input detector (title only / title + area / title + area + abstract). Implement 3-level NLP strategy: keyword extraction with KeyBERT or spaCy for level 1, LCC code mapping for level 2, sentence-transformers embeddings + cosine similarity for level 3. Pre-calculate and cache all journal scope embeddings. Model: all-MiniLM-L6-v2. Sprint 5 — Exploration mode (Streamlit): Build interactive dashboard with sidebar filters (topic area, country, quartile, APC tier, weeks, language, toggles), world bubble map with Plotly, APC vs Score scatter plot, area x country heatmap, filterable table with CSV export. Sprint 6 — Consultation mode (Streamlit): Adaptive form with title required and area + abstract optional. Priority sliders for speed / prestige / cost with weights that sum to 100%. Top-10 results with score breakdown bars, APC in USD, weeks, quartile, CC license, confidence badge (Low / Medium / High), direct DOAJ link, PDF and CSV export. Sprint 7 — REST API + auth + roles: FastAPI backend with JWT authentication. Endpoints: POST /match, GET /journal by ISSN, GET /explore, POST /auth/login, GET /history. SQLite database with tables for users, searches and usage tracking. Two roles: client (20 queries/month limit) and internal (unlimited). Connect Streamlit to the API. Sprint 8 — Deploy + automation + docs: Deploy FastAPI and Streamlit to Render.com or Railway.app (free tier). Monthly pipeline automation script that auto-downloads the new DOAJ CSV and regenerates the scored parquet. Full README documentation. Tech stack required (mandatory): Python 3.10+, pandas, numpy, pyarrow, scikit-learn, sentence-transformers, FastAPI, Streamlit, Plotly, requests, BeautifulSoup4, SQLite, SQLAlchemy, JWT auth, Git Desirable (not blocking): umap-learn, KeyBERT or spaCy, Render or Railway deploy experience, basic NLP background What we provide: Full technical specification document (Word, 11 sections) with architecture diagrams, complete data schema, scoring formulas, NLP strategy per input level and acceptance criteria per sprint Sprint 1 code (ingest.py) already built, tested and documented — ready to run DOAJ CSV file (22,890 journals, May 2026) Clear acceptance criteria per sprint for milestone payments Engagement model: Fixed-price contract with milestone payments per sprint delivery. Each sprint has defined acceptance criteria that must pass before payment is released. Ideal candidate: 3 to 5 years Python experience with data engineering projects Has built at least one FastAPI or Flask API with authentication Has worked with Streamlit or similar data web frameworks Comfortable with pandas, parquet files and scikit-learn Basic NLP experience with embeddings and similarity — not deep research required Strong English written communication for async collaboration Available for a short video call before contract start To apply, please answer: Have you worked with sentence-transformers or similar embedding libraries? Share a brief example. Have you deployed a FastAPI and Streamlit application before? Where? What is your estimated timeline for Sprint 2 (external enrichment)? Your fixed-price proposal for the full 8-sprint MVP.

  • Less than 30 hrs/week
    Hourly
  • 1-3 months
    Duration
  • Intermediate
    Experience Level
  • $8.00

    -

    $15.00

    Hourly
  • Remote Job
  • One-time project
    Project Type

Contract-to-hire opportunity

This lets talent know that this job could become full time.
Learn more
Skills and Expertise
Mandatory skills
Data Engineering
Data Preprocessing
Activity on this job
  • Proposals:5 to 10
  • Last viewed by client:5 days ago
  • Interviewing:
    0
  • Invites sent:
    0
  • Unanswered invites:
    0
About the client
Member since Jan 18, 2026
  • Peru
    4:47 AM

Explore similar jobs on Upwork

Database University AssignmentsHourly‐ Posted 8 months ago
SQL
Database
Microsoft Excel
Database Design
Database Management
SQL Server Integration Services
Excel Macros
Excel Formula
Microsoft Power BI
Microsoft Excel PowerPivot
Power Query
Data Entry
Data Cleaning
Data Analytics
Data Extraction
Python
ETL Pipeline
Linux System Administration
Linux
System Administration

How it works

  • Post a job icon
    Create your free profile
    Highlight your skills and experience, show your portfolio, and set your ideal pay rate.
  • Talent comes to you icon
    Work the way you want
    Apply for jobs, create easy-to-by projects, or access exclusive opportunities that come to you.
  • Payment simplified icon
    Get paid securely
    From contract to payment, we help you work safely and get paid securely.
Want to get started? Create a profile

About Upwork

  • Rating is 4.9 out of 5.
    4.9/5
    (Average rating of clients by professionals)
  • G2 2021
    #1 freelance platform
  • 49,000+
    Signed contract every week
  • $2.3B
    Freelancers earned on Upwork in 2020

Find the best freelance jobs

Growing your career is as easy as creating a free profile and finding work like this that fits your skills.

Trusted by

  • Microsoft Logo
  • Airbnb Logo
  • Bissell Logo
  • GoDaddy Logo