Scalable - easy to ingest new data source, and easy to add new etl job
Maintainable - easy to iterate job, and update configurations
Reliable - data quality check, and alerting + monitoring
Architecture
Use Airflow to Orchestrate
Either deploy our own Spark (for MapReduce) on YARN (using AWS EMR) or choose a managed spark service like Azure or Databricks
Our end table can be stored in a managed NoSQL, this provides better big data storeage functionality compared to a traditional RDBMS; or we can use Hive over HDFS to build our data warehouse
Layered Data Pipeline
Data Source
Batch - Identify possible API or data source location, liek s3 bucket, Google Storage etc.
Stream - set up kafka , potentially using a managed cloud service, like AWS MSK
Data Ingestion - Raw Data
Python Request library to get the source data; perform basic (like data type clean up) using pandas dataframe
Spark Streaming API - scalable, high-throughput, fault-tolerant, based on its natrue of being a distributed computing system, this will integrate with our Kafka data source easily
Transformation
Spark to do the transformation, choice of programming language can be but not limited to Scala, Python, or Java
if we decide to use the K8s operator from Airflow, we can run our image in any environment
Most of the MapReduce process
Based on the business need, we should target for msot reusable in-between tables
Data Modeling is gonna take place
User Profile table
Marketing table
Target Data Modeling
If we model our in-between tables really well, this step should be relatively easy. We can easily grab the necessary fields and dimensions from our in between tables, and join them together; if we are targeting to perform advanced data analyisis or ML, we can utilize Spark ML package or setup connection between HDFS and Tensorflow.
Some possible use case for the end result
Export our data to TensorFlow for ML
Export data to some BI, Looker (this require setup a connection, define view and models), or to Tableau (conenction adn tableau refresh)
In-house Data Delivery system - daily email, or summary chart, spreadsheet to a Slack channel
Data Quality Check
After each step of Data Ingestion, Data Transformation, we should define some metrics for data quality check
Id can not be null
Revenue must be positive
Transaction number falls in 1.5 std of the historical mean
Country code should be alpha2 code
Stock ticker should be string data type, upper case
Alerting system
We should integate this with our Airflow
Airflow manage the dependency, and manage the SLA for each task
Airflow should also output log to ElasticSearch (full text index database system), so we can quickly identify issue with Kibana
Many options in the market for alerting - pagerDuty, where we have a on-call schedule, and higher prioirty failure will be handled immediately.