Knack-Scraper/transform/README.md

# Knack Transform

Data transformation pipeline for the Knack scraper project.

## Overview

This folder contains the transformation logic that processes data from the SQLite database. It runs on a scheduled basis (every weekend) via cron.

The pipeline supports **parallel execution** of independent transform nodes, allowing you to leverage multi-core processors for faster data transformation.

## Structure

- `base.py` - Abstract base class for transform nodes
- `pipeline.py` - Parallel pipeline orchestration system
- `main.py` - Main entry point and pipeline execution
- `author_node.py` - NER-based author classification node
- `example_node.py` - Template for creating new nodes
- `Dockerfile` - Docker image configuration with cron setup
- `requirements.txt` - Python dependencies

## Transform Nodes

Transform nodes inherit from `TransformNode` and implement the `run` method:

```python
from base import TransformNode, TransformContext
import sqlite3

class MyTransform(TransformNode):
    def run(self, con: sqlite3.Connection, context: TransformContext) -> TransformContext:
        df = context.get_dataframe()

        # Transform logic here
        transformed_df = df.copy()
        # ... your transformations ...

        # Optionally write back to database
        transformed_df.to_sql("my_table", con, if_exists="replace", index=False)

        return TransformContext(transformed_df)
```

## Configuration

Copy `.env.example` to `.env` and configure:

- `LOGGING_LEVEL` - Log level (INFO or DEBUG)
- `DB_PATH` - Path to SQLite database

## Running

### With Docker

```bash
docker build -t knack-transform .
docker run -v $(pwd)/data:/data knack-transform
```

### Locally

```bash
python main.py
```

## Cron Schedule

The Docker container runs the transform pipeline every Sunday at 3 AM.