Knack-Scraper/transform
2025-12-21 21:18:05 +01:00
..
.env.example Implements Feature to cleanup authors freetext field 2025-12-21 21:18:05 +01:00
author_node.py Implements Feature to cleanup authors freetext field 2025-12-21 21:18:05 +01:00
base.py Implements Feature to cleanup authors freetext field 2025-12-21 21:18:05 +01:00
Dockerfile Implements Feature to cleanup authors freetext field 2025-12-21 21:18:05 +01:00
main.py Implements Feature to cleanup authors freetext field 2025-12-21 21:18:05 +01:00
README.md Implements Feature to cleanup authors freetext field 2025-12-21 21:18:05 +01:00
requirements.txt Implements Feature to cleanup authors freetext field 2025-12-21 21:18:05 +01:00

Knack Transform

Data transformation pipeline for the Knack scraper project.

Overview

This folder contains the transformation logic that processes data from the SQLite database. It runs on a scheduled basis (every weekend) via cron.

Structure

  • base.py - Abstract base class for transform nodes
  • main.py - Main entry point and pipeline orchestration
  • Dockerfile - Docker image configuration with cron setup
  • requirements.txt - Python dependencies

Transform Nodes

Transform nodes inherit from TransformNode and implement the run method:

from base import TransformNode, TransformContext
import sqlite3

class MyTransform(TransformNode):
    def run(self, con: sqlite3.Connection, context: TransformContext) -> TransformContext:
        df = context.get_dataframe()
        
        # Transform logic here
        transformed_df = df.copy()
        # ... your transformations ...
        
        # Optionally write back to database
        transformed_df.to_sql("my_table", con, if_exists="replace", index=False)
        
        return TransformContext(transformed_df)

Configuration

Copy .env.example to .env and configure:

  • LOGGING_LEVEL - Log level (INFO or DEBUG)
  • DB_PATH - Path to SQLite database

Running

With Docker

docker build -t knack-transform .
docker run -v $(pwd)/data:/data knack-transform

Locally

python main.py

Cron Schedule

The Docker container runs the transform pipeline every Sunday at 3 AM.