forked from lukaszett/Knack-Scraper
Implements Feature to cleanup authors freetext field
This commit is contained in:
parent
bcd210ce01
commit
64df8fb328
14 changed files with 804 additions and 310 deletions
62
transform/README.md
Normal file
62
transform/README.md
Normal file
|
|
@ -0,0 +1,62 @@
|
|||
# Knack Transform
|
||||
|
||||
Data transformation pipeline for the Knack scraper project.
|
||||
|
||||
## Overview
|
||||
|
||||
This folder contains the transformation logic that processes data from the SQLite database. It runs on a scheduled basis (every weekend) via cron.
|
||||
|
||||
## Structure
|
||||
|
||||
- `base.py` - Abstract base class for transform nodes
|
||||
- `main.py` - Main entry point and pipeline orchestration
|
||||
- `Dockerfile` - Docker image configuration with cron setup
|
||||
- `requirements.txt` - Python dependencies
|
||||
|
||||
## Transform Nodes
|
||||
|
||||
Transform nodes inherit from `TransformNode` and implement the `run` method:
|
||||
|
||||
```python
|
||||
from base import TransformNode, TransformContext
|
||||
import sqlite3
|
||||
|
||||
class MyTransform(TransformNode):
|
||||
def run(self, con: sqlite3.Connection, context: TransformContext) -> TransformContext:
|
||||
df = context.get_dataframe()
|
||||
|
||||
# Transform logic here
|
||||
transformed_df = df.copy()
|
||||
# ... your transformations ...
|
||||
|
||||
# Optionally write back to database
|
||||
transformed_df.to_sql("my_table", con, if_exists="replace", index=False)
|
||||
|
||||
return TransformContext(transformed_df)
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
Copy `.env.example` to `.env` and configure:
|
||||
|
||||
- `LOGGING_LEVEL` - Log level (INFO or DEBUG)
|
||||
- `DB_PATH` - Path to SQLite database
|
||||
|
||||
## Running
|
||||
|
||||
### With Docker
|
||||
|
||||
```bash
|
||||
docker build -t knack-transform .
|
||||
docker run -v $(pwd)/data:/data knack-transform
|
||||
```
|
||||
|
||||
### Locally
|
||||
|
||||
```bash
|
||||
python main.py
|
||||
```
|
||||
|
||||
## Cron Schedule
|
||||
|
||||
The Docker container runs the transform pipeline every Sunday at 3 AM.
|
||||
Loading…
Add table
Add a link
Reference in a new issue