Almost 4 years of experience in Design & Development of Big Data & Cloud technologies.
Experience in PySpark, Spark-Sql, ADF, Azure Databricks, Python, CI/CD, Azure DevOps, Hive.
Hands on experience in optimization and performance tuning in spark.
Proficient in Requirement Gathering, Solution Designing, Development.
Cohesive Team Player with Fast Learning ability along with Innovative, Analytical, Problem Solving, and good Communication skills.
Big data Technologies: PySpark, Azure Databricks, Hive
Hadoop Components: PySpark, HDFS, Azure Storage
Cloud Components: Azure ADF, Azure Databricks, Azure Synapse, Azure LogicApp, Azure DevOps
Languages: Pyspark, Python, SQL, Unix/Linux
Database: Azure SQL Database, MySQL, Hive
Version Control: Git
IDE : Jupyter, Pycharm
Operating Systems: MS Windows, Unix/Linux
Issue tracking Tool: Jira
Other tool: SQLWorkbench, Putty, MobaXterm, MS Office
Project – REKBDSOR (Retail Execution Key Business Driver System of Records)
Domain- RCM
Technology-Big Data – Azure Databricks, PySpark, Spark-SQL, Azure ADF, Azure Storage, Azure Logic App
Description-3Sixty System of records (SOR)/REKBDSOR is initiative of P&G to consolidate the audit transactional data for all retailers across 20 countries. The countries are bucketed into two groups based on source data of transactional data namely SFDC & Planorama. Two different sources will be utilized to provide master data for all countries – Global master data and Local master data. The consolidated transactional data and master data will be utilized to generate KPIs (Measures and Golden Point) for tracing the performance of retail execution.
WORK AND RESPONSBILITIES
Working on Azure Databricks for development activity based on the business requirement from market.
Transactional data receiving from either Planorama to bp2 to SOR delta tables in structured form.
For report generation writing the code in PySpark and doing unit testing.
After unit testing deploying in the PROD environment through a CR.
Writing data in different file format either parquet or delta or csv.
Creating and scheduling (Scheduled or Event based trigger) the Pipelines in ADF at market level.
Sending the report to downstream paths (Highway) and published on PBI level.
Following agile methodologies.
End to End handling of Projects as well as application requests.
PREVIOUS WORK EXPERIENCE
Project – AML (Anti Money Laundering)
Domain- Banking
Technology-Big Data –Hadoop (PySpark, Hive, Spark-SQL, AutoSys)
Description-It’s a data driven consulting project. Objective of project to alert generation of the customers. Target those customers which breach the threshold limit according to business requirement based on their transactions which has related customer and account information.
WORK AND RESPONSBILITIES
Writing Spark Jobs for Data Consumption from HDFS and transform it as per the requirement and then ingesting it into HDFS raw zone.
Built an Enterprise Data Lake by doing various aggregations as per the business requirement from HDFS raw zone data to provide a single abstracted view of business data to our client’s business units for advanced analytics.
Scheduling of Batch Spark jobs via AutoSys Scheduler in batch mode on pre-defined frequency.
Worked on different compression techniques like lz4, bzip2, snappy and LZO.
Writing Shell scripts for cleansing the data from sources.
Worked with structured data of 100 GB daily with replication factor 3.
Following agile methodologies.
End to End handling of Projects as well as application requests.
PREVIOUS WORK EXPERIENCE
Project – Health management, risk stratification and prevention
Domain- Healthcare
Technology-Big Data –Hadoop (Hive, HDFS, Sqoop, Linux)
Description: This project enables healthcare providers to improve operational effectiveness, reduce costs, reduce medical errors and enhance delivery of quality of care. This system was developed to automate the Hospital system. The purpose of System was to keep information of patients that are registered.
WORK AND RESPONSBILITIES
Worked on Hadoop cluster (CDH 5) with 5 nodes.
Worked with structured data of 10 GB with replication factor 3.
Extracted the data from MySQL databases into HDFS using Sqoop.
Created jobs in Sqoop with incremental load and populated Hive tables.
Wrote Hive scripts and transformed raw data from several sources.
Created both managed and external tables to optimize performance.
Solved performance issues in Hive by modifying joins, implementing partitioning, bucketing, grouping, aggregation etc.
Worked with various file formats like ORC, Avro and Parquet file format