# Knack Transform Data transformation pipeline for the Knack scraper project. ## Overview This folder contains the transformation logic that processes data from the SQLite database. It runs on a scheduled basis (every weekend) via cron. ## Structure - `base.py` - Abstract base class for transform nodes - `main.py` - Main entry point and pipeline orchestration - `Dockerfile` - Docker image configuration with cron setup - `requirements.txt` - Python dependencies ## Transform Nodes Transform nodes inherit from `TransformNode` and implement the `run` method: ```python from base import TransformNode, TransformContext import sqlite3 class MyTransform(TransformNode): def run(self, con: sqlite3.Connection, context: TransformContext) -> TransformContext: df = context.get_dataframe() # Transform logic here transformed_df = df.copy() # ... your transformations ... # Optionally write back to database transformed_df.to_sql("my_table", con, if_exists="replace", index=False) return TransformContext(transformed_df) ``` ## Configuration Copy `.env.example` to `.env` and configure: - `LOGGING_LEVEL` - Log level (INFO or DEBUG) - `DB_PATH` - Path to SQLite database ## Running ### With Docker ```bash docker build -t knack-transform . docker run -v $(pwd)/data:/data knack-transform ``` ### Locally ```bash python main.py ``` ## Cron Schedule The Docker container runs the transform pipeline every Sunday at 3 AM.