67 lines
1.8 KiB
Markdown
67 lines
1.8 KiB
Markdown
# Knack Transform
|
|
|
|
Data transformation pipeline for the Knack scraper project.
|
|
|
|
## Overview
|
|
|
|
This folder contains the transformation logic that processes data from the SQLite database. It runs on a scheduled basis (every weekend) via cron.
|
|
|
|
The pipeline supports **parallel execution** of independent transform nodes, allowing you to leverage multi-core processors for faster data transformation.
|
|
|
|
## Structure
|
|
|
|
- `base.py` - Abstract base class for transform nodes
|
|
- `pipeline.py` - Parallel pipeline orchestration system
|
|
- `main.py` - Main entry point and pipeline execution
|
|
- `author_node.py` - NER-based author classification node
|
|
- `example_node.py` - Template for creating new nodes
|
|
- `Dockerfile` - Docker image configuration with cron setup
|
|
- `requirements.txt` - Python dependencies
|
|
|
|
## Transform Nodes
|
|
|
|
Transform nodes inherit from `TransformNode` and implement the `run` method:
|
|
|
|
```python
|
|
from base import TransformNode, TransformContext
|
|
import sqlite3
|
|
|
|
class MyTransform(TransformNode):
|
|
def run(self, con: sqlite3.Connection, context: TransformContext) -> TransformContext:
|
|
df = context.get_dataframe()
|
|
|
|
# Transform logic here
|
|
transformed_df = df.copy()
|
|
# ... your transformations ...
|
|
|
|
# Optionally write back to database
|
|
transformed_df.to_sql("my_table", con, if_exists="replace", index=False)
|
|
|
|
return TransformContext(transformed_df)
|
|
```
|
|
|
|
## Configuration
|
|
|
|
Copy `.env.example` to `.env` and configure:
|
|
|
|
- `LOGGING_LEVEL` - Log level (INFO or DEBUG)
|
|
- `DB_PATH` - Path to SQLite database
|
|
|
|
## Running
|
|
|
|
### With Docker
|
|
|
|
```bash
|
|
docker build -t knack-transform .
|
|
docker run -v $(pwd)/data:/data knack-transform
|
|
```
|
|
|
|
### Locally
|
|
|
|
```bash
|
|
python main.py
|
|
```
|
|
|
|
## Cron Schedule
|
|
|
|
The Docker container runs the transform pipeline every Sunday at 3 AM.
|