Summary:
This capstone project was developed as part of the Data Engineering BootCamp I attended in summer 2024. It directly addresses a real-life challenge I encountered in my current job—automating the data pipeline to efficiently process and transform raw Chatfuel data. By reverse-engineering a non-standard API, simulating event-based data, and implementing robust data transformations, I built a scalable solution that delivers actionable insights daily.
Tools, I leaned on: Python, Airflow, dbt, BigQuery, Looker.
During the summer 2024 Data Engineering BootCamp at DataExpert.io, I had the opportunity to tackle a challenging problem that I face in my current role. This project was not only a learning experience but also a practical solution to streamline the messy and manual process of handling raw Chatfuel data.
Key Highlights:
1. Advanced Data Extraction:
Without any public documentation, I had to reverse-engineer Chatfuel’s GraphQL API. This involved navigating through a dynamic data structure where the number of attributes could exceed 2000 and constantly change. This challenge pushed me to develop a flexible method that could automatically detect and adapt to new data fields, ensuring comprehensive data capture.
2. Simulating Event-Based Data:
Chatfuel’s architecture doesn’t naturally support event tracking. To overcome this, I implemented a novel approach by taking periodic snapshots of user data. By doing so, I simulated an event-based system where each data pull serves as a historical record, allowing for detailed analysis of user behavior over time. This creative workaround has made it possible to conduct trend analysis that was previously unachievable with the native data.
3. Robust Transformation Pipeline with DBT and BigQuery:
Once the raw data was ingested, I leveraged DBT in BigQuery to perform incremental transformations. This setup not only ensured that new or updated data was seamlessly integrated but also maintained a complete historical record. I implemented merge strategies and post-processing hooks to keep the data consistent and up-to-date.
4. Automation and Orchestration with Airflow:
To tie everything together, I used Airflow to orchestrate the entire pipeline. This automation covers data collection, transformation, and even error handling. The system is scheduled to run daily, providing stakeholders with timely, actionable insights through a Looker dashboard that visualizes key performance indicators and trends.
5. Real-World Application:
At my current job, I encounter operational challenges with manual data handling and outdated systems. This project directly addresses these issues by automating complex data workflows, reducing manual intervention, and enabling daily analysis. The insights generated from this pipeline empower decision-makers to act quickly and confidently.
For a more detailed description of the project’s architecture, transformation steps, and the code itself, you can dive into the repository on GitHub.