forked from lukaszett/Knack-Scraper
| .. | ||
| .env.example | ||
| author_node.py | ||
| base.py | ||
| Dockerfile | ||
| main.py | ||
| README.md | ||
| requirements.txt | ||
Knack Transform
Data transformation pipeline for the Knack scraper project.
Overview
This folder contains the transformation logic that processes data from the SQLite database. It runs on a scheduled basis (every weekend) via cron.
Structure
base.py- Abstract base class for transform nodesmain.py- Main entry point and pipeline orchestrationDockerfile- Docker image configuration with cron setuprequirements.txt- Python dependencies
Transform Nodes
Transform nodes inherit from TransformNode and implement the run method:
from base import TransformNode, TransformContext
import sqlite3
class MyTransform(TransformNode):
def run(self, con: sqlite3.Connection, context: TransformContext) -> TransformContext:
df = context.get_dataframe()
# Transform logic here
transformed_df = df.copy()
# ... your transformations ...
# Optionally write back to database
transformed_df.to_sql("my_table", con, if_exists="replace", index=False)
return TransformContext(transformed_df)
Configuration
Copy .env.example to .env and configure:
LOGGING_LEVEL- Log level (INFO or DEBUG)DB_PATH- Path to SQLite database
Running
With Docker
docker build -t knack-transform .
docker run -v $(pwd)/data:/data knack-transform
Locally
python main.py
Cron Schedule
The Docker container runs the transform pipeline every Sunday at 3 AM.