forked from lukaszett/Knack-Scraper
1.5 KiB
1.5 KiB
Knack Transform
Data transformation pipeline for the Knack scraper project.
Overview
This folder contains the transformation logic that processes data from the SQLite database. It runs on a scheduled basis (every weekend) via cron.
Structure
base.py- Abstract base class for transform nodesmain.py- Main entry point and pipeline orchestrationDockerfile- Docker image configuration with cron setuprequirements.txt- Python dependencies
Transform Nodes
Transform nodes inherit from TransformNode and implement the run method:
from base import TransformNode, TransformContext
import sqlite3
class MyTransform(TransformNode):
def run(self, con: sqlite3.Connection, context: TransformContext) -> TransformContext:
df = context.get_dataframe()
# Transform logic here
transformed_df = df.copy()
# ... your transformations ...
# Optionally write back to database
transformed_df.to_sql("my_table", con, if_exists="replace", index=False)
return TransformContext(transformed_df)
Configuration
Copy .env.example to .env and configure:
LOGGING_LEVEL- Log level (INFO or DEBUG)DB_PATH- Path to SQLite database
Running
With Docker
docker build -t knack-transform .
docker run -v $(pwd)/data:/data knack-transform
Locally
python main.py
Cron Schedule
The Docker container runs the transform pipeline every Sunday at 3 AM.