Lead and coordinate project team members in project delivery
Involve in technical sales processes and delivery process design and standardization
Gather user requirements and construct UML diagrams for requirements modelling
Gather and process raw data at scale (including writing scripts, web scraping, calling
APIs, writing SQL queries, etc)
Design, develop & document data pipeline and analysis programs using Hadoop and related
ecosystem tools such as Hive & Spark
Design, develop & document predictive models utilizing tools included in Hadoop cluster
such as Spark MLlib
Design, develop & document data ingestion, data pre-preparation, data cleansing and data
standardization rules to prepare datasets for analysis, and ensure the process are executed
in optimized and timely manner.
Design, develop & document methods to transform unstructured datasets such as text, audio
and video to structured attributes.
Design, develop & document data processing workflow and governance rules on Python,
Airflow and Ranger
Design, develop & document RESTful Web API and Web Applications for productization of data
pipeline and processing workflow on Python.
Conduct requirement gathering to understand customer needs and as-is data ecosystem
Work with subject matter experts to translate domain knowledge into data processing
pipelines and data products
Design, develop & document data products such as web based visualization dashboards and
data collection applications/services.
Design, deploy, manage & document data processing infrastructure both on-site and on-cloud
Design and develop automated unit test scripts for developed softwares
5+ years’ experience in software development projects or ETL /
data warehousing / master data management projects
Experience in system development lifecycle, either professionally or as hobby.
Programming knowledge to clean, and scrub noisy datasets.
Self-driven and able to take own initiative to learn and explore
Capable of picking up new technologies and practices in a rapid manner
Solid foundation in mathematical and algorithmic thinking
Strong background and experience in statistics is a plus
Background in UML modeling is a plus
Coaching and self-paced training materials will be provided.
Join a high energy team, which includes several Open Source contributors working towards
transforming the local IT industry through Open Source technologies.
Opportunity to work with multiple high-demand Open Source technologies in the Big Data
market which includes the following technologies’ ecosystems Hadoop, Hive, Spark,
Python (Morepath, Flask, Celery, Scikit, Superset, PySpark, Buildbot, SQLAlchemy,
Py.test), PostgreSQL, Druid, HBase, RabbitMQ, Ansible, Docker, Kubernetes, HAProxy,
Please fill in the form below. If you're shortlisted, we will get in