We are a market research company looking for an experienced Spark / Scala developer to work on a large-scale data pipeline. We currently have smaller tasks to start with but if it goes well, we will want to consider you for future projects and/or ongoing work.
Current Project: We currently have several S3 buckets with Aws Server access logs enabled. We want to be able to monitor and query easily for S3 download activity and data usage.
Currently, AWS writes files logs into small JSON files in a flat folder structure, which makes querying in tools like Athena or Redshift slow.
We would like to drop events not related to download activity, partition by date and convert to Parquet.
The task is to write a Spark Scala job that processes S3 JSON Log files stored in S3 and convert/sort/output these files to partitioned Parquet. It will drop columns and reparation based to create reasonably sized files for use in Athena.
Candidates must have:
- A lot of experience with the Apache Spark framework
- Proficient in Scala and Python programming languages
- Prior work with Parquet data format
- Strong bash, 'nix, and version control with GIT
We prefer candidates with experience in:
- Apache Spark Streaming, Flink, Storm, Heron, or a comparable stream processor
- Kafka, Kinesis, or a comparable message broker
- Airflow or a comparable scheduler
- Amazon Web Services (AWS)
- Communicate clearly (Slack, JIRA knowledge is a plus)
- Able to meet deadlines in a timely manner
- Willing to sign an NDA
Less than 30 hrs/week
Less than 1 month< 1 monthProject LengthDuration
I am willing to pay higher rates for the most experienced freelancers