# Knack Database Visualization

This notebook explores and visualizes the findings from the `knack.sqlite` database using Altair for interactive data visualizations.

## 1. Import Required Libraries

Import necessary libraries for data manipulation and visualization.

In [1]:
import sqlite3
import pandas as pd
import altair as alt
from pathlib import Path

# Configure Altair
alt.data_transformers.disable_max_rows()
alt.renderers.enable('default')

print("Libraries imported successfully!")

Libraries imported successfully!


## 2. Connect to SQLite Database

Establish connection to the knack.sqlite database and explore its structure.

In [2]:
# Connect to the database
db_path = Path('../data/knack.transformed.sqlite')
conn = sqlite3.connect(db_path)
cursor = conn.cursor()

# Get all table names
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = cursor.fetchall()

print("Tables in the database:")
for table in tables:
    print(f"  - {table[0]}")

Tables in the database:
  - posts
  - posttags
  - postcategories
  - tags
  - categories
  - authors
  - post_authors


## 3. Explore Database Schema

Examine the structure of each table to understand the data.

In [3]:
# Examine schema for each table
for table in tables:
    table_name = table[0]
    print(f"\n{'='*60}")
    print(f"Table: {table_name}")
    print('='*60)
    
    # Get column information
    cursor.execute(f"PRAGMA table_info({table_name})")
    columns = cursor.fetchall()
    
    print("\nColumns:")
    for col in columns:
        print(f"  {col[1]:20} {col[2]:15} {'NOT NULL' if col[3] else ''}")
    
    # Get row count
    cursor.execute(f"SELECT COUNT(*) FROM {table_name}")
    count = cursor.fetchone()[0]
    print(f"\nTotal rows: {count}")


Table: posts

Columns:
  index                INTEGER         
  id                   INTEGER         
  title                TEXT            
  author               TEXT            
  date                 TIMESTAMP       
  category             TEXT            
  url                  TEXT            
  img_link             TEXT            
  tags                 TEXT            
  text                 TEXT            
  html                 TEXT            
  scraped_at           TIMESTAMP       
  is_cleaned           BOOLEAN         
  embedding            BLOB            
  umap_x               REAL            
  umap_y               REAL            

Total rows: 3678

Table: posttags

Columns:
  post_id              INTEGER         
  tag_id               INTEGER         

Total rows: 14272

Table: postcategories

Columns:
  post_id              INTEGER         
  category_id          INTEGER         

Total rows: 3691

Table: tags

Columns:
  id                   INTEGER        

## 4. Load Data from Database

Load the data from tables into pandas DataFrames for analysis and visualization.

In [4]:
# Load all tables into DataFrames
dataframes = {}

for table in tables:
    table_name = table[0]
    query = f"SELECT * FROM {table_name}"
    df = pd.read_sql_query(query, conn)
    dataframes[table_name] = df
    print(f"Loaded {table_name}: {df.shape[0]} rows, {df.shape[1]} columns")

# Display available dataframes
print(f"\nAvailable dataframes: {list(dataframes.keys())}")

Loaded posts: 3678 rows, 16 columns
Loaded posttags: 14272 rows, 2 columns
Loaded postcategories: 3691 rows, 2 columns
Loaded tags: 64 rows, 2 columns
Loaded categories: 6 rows, 2 columns
Loaded authors: 1143 rows, 4 columns
Loaded post_authors: 4934 rows, 2 columns

Available dataframes: ['posts', 'posttags', 'postcategories', 'tags', 'categories', 'authors', 'post_authors']


## 5. Explore Data Structure

Examine the first dataframe to understand the data better.

In [5]:
# Select the first table to explore (or specify a specific table)
if dataframes:
    first_table = list(dataframes.keys())[0]
    df = dataframes[first_table]
    
    print(f"Exploring: {first_table}")
    print(f"\nShape: {df.shape}")
    print(f"\nData types:\n{df.dtypes}")
    
    print(f"\nMissing values:")
    print(df.isnull().sum())

Exploring: posts

Shape: (3678, 16)

Data types:
index           int64
id              int64
title          object
author         object
date           object
category       object
url            object
img_link       object
tags           object
text           object
html           object
scraped_at     object
is_cleaned      int64
embedding      object
umap_x        float64
umap_y        float64
dtype: object

Missing values:
index           0
id              0
title           0
author          3
date            3
category        3
url             0
img_link      148
tags            4
text            0
html            0
scraped_at      0
is_cleaned      0
embedding       0
umap_x          0
umap_y          0
dtype: int64


## 7. Create Time Series Visualizations

If the data contains temporal information, create time series visualizations.

In [6]:
# Check for date/time columns and create time series visualizations
if dataframes:
    df = dataframes[list(dataframes.keys())[0]]
    
    # Look for columns that might contain dates (check column names)
    date_like_cols = [col for col in df.columns if any(
        keyword in col.lower() for keyword in ['date', 'time', 'created', 'updated', 'timestamp']
    )]
    
    if date_like_cols:
        print(f"Found potential date columns: {date_like_cols}")
        
        # Try to convert the first date-like column to datetime
        date_col = date_like_cols[0]
        try:
            df[date_col] = pd.to_datetime(df[date_col], errors='coerce')
            
            # Create a time series chart - count records over time
            time_series = df.groupby(pd.Grouper(key=date_col, freq='M')).size().reset_index(name='count')
            
            chart = alt.Chart(time_series).mark_line(point=True).encode(
                x=alt.X(f'{date_col}:T', title='Date'),
                y=alt.Y('count:Q', title='Count'),
                tooltip=[date_col, 'count']
            ).properties(
                title=f'Records Over Time',
                width=700,
                height=400
            ).interactive()
            
            display(chart)
        except Exception as e:
            print(f"Could not create time series chart: {e}")
    else:
        print("No date/time columns found")

Found potential date columns: ['date']


  time_series = df.groupby(pd.Grouper(key=date_col, freq='M')).size().reset_index(name='count')


### Articles per Category

Visualize the distribution of articles across different categories.

In [7]:
dataframes.keys()

dict_keys(['posts', 'posttags', 'postcategories', 'tags', 'categories', 'authors', 'post_authors'])

In [8]:
# Check if categorisation data exists and create histogram
if 'postcategories' in dataframes and 'categories' in dataframes:
    df_post_cat = dataframes['postcategories']
    df_categories = dataframes['categories']
    
    # Join postcategories with categories to get category names
    if 'category_id' in df_post_cat.columns and 'id' in df_categories.columns and 'category' in df_categories.columns:
        # Merge the two tables
        df_merged = df_post_cat.merge(
            df_categories[['id', 'category']], 
            left_on='category_id', 
            right_on='id',
            how='left'
        )
        
        # Count articles per category
        category_counts = df_merged['category'].value_counts().reset_index()
        category_counts.columns = ['category', 'article_count']
        
        # Sort by count descending
        category_counts = category_counts.sort_values('article_count', ascending=False)
        
        chart = alt.Chart(category_counts).mark_bar().encode(
            x=alt.X('category:N', sort='-y', title='Category', axis=alt.Axis(labelAngle=-45)),
            y=alt.Y('article_count:Q', title='Number of Articles'),
            color=alt.Color('article_count:Q', scale=alt.Scale(scheme='viridis'), legend=None),
            tooltip=['category', alt.Tooltip('article_count:Q', title='Articles')]
        ).properties(
            title='Distribution of Articles per Category',
            width=700,
            height=450
        ).interactive()
        
        display(chart)
        
        # Show summary statistics
        print(f"\nTotal categories: {len(category_counts)}")
        print(f"Most articles in a category: {category_counts['article_count'].max()}")
        print(f"Average articles per category: {category_counts['article_count'].mean():.2f}")
    else:
        print("Could not find required columns for joining tables")
else:
    print("Need both 'postcategories' and 'categories' tables in database")


Total categories: 6
Most articles in a category: 2098
Average articles per category: 615.17


### Articles per Tag

Visualize the distribution of articles across different tags.

In [9]:
# Check if tag data exists and create histogram
if 'posttags' in dataframes and 'tags' in dataframes:
    df_post_tags = dataframes['posttags']
    df_tags = dataframes['tags']
    
    # Join posttags with tags to get tag names
    if 'tag_id' in df_post_tags.columns and 'id' in df_tags.columns and 'tag' in df_tags.columns:
        # Merge the two tables
        df_merged = df_post_tags.merge(
            df_tags[['id', 'tag']], 
            left_on='tag_id', 
            right_on='id',
            how='left'
        )
        
        # Count articles per tag
        tag_counts = df_merged['tag'].value_counts().reset_index()
        tag_counts.columns = ['tag', 'article_count']
        
        # Show top 30 tags for readability
        tag_counts_top = tag_counts.head(30).sort_values('article_count', ascending=False)
        
        chart = alt.Chart(tag_counts_top).mark_bar().encode(
            x=alt.X('tag:N', sort='-y', title='Tag', axis=alt.Axis(labelAngle=-45)),
            y=alt.Y('article_count:Q', title='Number of Articles'),
            color=alt.Color('article_count:Q', scale=alt.Scale(scheme='oranges'), legend=None),
            tooltip=['tag', alt.Tooltip('article_count:Q', title='Articles')]
        ).properties(
            title='Distribution of Articles per Tag (Top 30)',
            width=700,
            height=450
        ).interactive()
        
        display(chart)
        
        # Show summary statistics
        print(f"\nTotal tags: {len(tag_counts)}")
        print(f"Most articles with a tag: {tag_counts['article_count'].max()}")
        print(f"Average articles per tag: {tag_counts['article_count'].mean():.2f}")
        print(f"Median articles per tag: {tag_counts['article_count'].median():.2f}")
    else:
        print("Could not find required columns for joining tables")
else:
    print("Need both 'posttags' and 'tags' tables in database")


Total tags: 64
Most articles with a tag: 1954
Average articles per tag: 223.00
Median articles per tag: 101.50


### Articles per Author

Visualize the distribution of articles across different authors.

In [10]:
# Check if author data exists and create histogram
if 'post_authors' in dataframes and 'authors' in dataframes:
    df_post_tags = dataframes['post_authors']
    df_tags = dataframes['authors']
    
    # Join posttags with tags to get tag names
    if 'author_id' in df_post_tags.columns and 'id' in df_tags.columns and 'name' in df_tags.columns:
        # Merge the two tables
        df_merged = df_post_tags.merge(
            df_tags[['id', 'name']], 
            left_on='author_id', 
            right_on='id',
            how='left'
        )
        
        # Count articles per tag
        tag_counts = df_merged['name'].value_counts().reset_index()
        tag_counts.columns = ['author', 'article_count']
        
        # Show top 30 tags for readability
        tag_counts_top = tag_counts.head(30).sort_values('article_count', ascending=False)
        
        chart = alt.Chart(tag_counts_top).mark_bar().encode(
            x=alt.X('author:N', sort='-y', title='Author', axis=alt.Axis(labelAngle=-45)),
            y=alt.Y('article_count:Q', title='Number of Articles'),
            color=alt.Color('article_count:Q', scale=alt.Scale(scheme='oranges'), legend=None),
            tooltip=['author', alt.Tooltip('article_count:Q', title='Articles')]
        ).properties(
            title='Distribution of Articles per Author (Top 30)',
            width=700,
            height=450
        ).interactive()
        
        display(chart)
        
        # Show summary statistics
        print(f"\nTotal authors: {len(tag_counts)}")
        print(f"Most articles with a author: {tag_counts['article_count'].max()}")
        print(f"Average articles per author: {tag_counts['article_count'].mean():.2f}")
        print(f"Median articles per author: {tag_counts['article_count'].median():.2f}")
    else:
        print("Could not find required columns for joining tables")
else:
    print("Need both 'post_authors' and 'authors' tables in database")


Total authors: 1126
Most articles with a author: 700
Average articles per author: 4.38
Median articles per author: 1.00


### UMAP Visualization

Visualize the UMAP dimensionality reduction in 2D space.

In [11]:
# Check for UMAP coordinates and create scatter plot with author coloring
umap_found = False

# Look for tables with umap_x and umap_y columns
for table_name, df in dataframes.items():
    if 'umap_x' in df.columns and 'umap_y' in df.columns:
        print(f"Found UMAP coordinates in table: {table_name}")
        umap_found = True
        
        # Check if we can join with authors
        if 'posts' in dataframes and 'post_authors' in dataframes and 'authors' in dataframes:
            df_posts = dataframes['posts']
            df_post_authors = dataframes['post_authors']
            df_authors = dataframes['authors']
            
            # Check if the current table has necessary columns for joining
            if 'id' in df.columns or 'post_id' in df.columns:
                post_id_col = 'id' if 'id' in df.columns else 'post_id'
                
                # Start with posts table that has UMAP coordinates
                df_umap = df[[post_id_col, 'umap_x', 'umap_y']].dropna(subset=['umap_x', 'umap_y'])
                
                # Join with post_authors to get author_id
                if 'post_id' in df_post_authors.columns and 'author_id' in df_post_authors.columns:
                    df_umap = df_umap.merge(
                        df_post_authors[['post_id', 'author_id']],
                        left_on=post_id_col,
                        right_on='post_id',
                        how='left'
                    )
                    
                    # Join with authors to get author name
                    if 'id' in df_authors.columns and 'name' in df_authors.columns:
                        df_umap = df_umap.merge(
                            df_authors[['id', 'name']],
                            left_on='author_id',
                            right_on='id',
                            how='left'
                        )
                        
                        # Rename name column to author for clarity
                        df_umap = df_umap.rename(columns={'name': 'author'})
                        
                        # Fill missing authors with 'Unknown'
                        df_umap['author'] = df_umap['author'].fillna('Unknown')
                        
                        # Get top 15 authors by count for better visualization
                        top_authors = df_umap['author'].value_counts().head(15).index.tolist()
                        df_umap['author_group'] = df_umap['author'].apply(
                            lambda x: x if x in top_authors else 'Other'
                        )
                        
                        # Create scatter plot with author coloring
                        scatter = alt.Chart(df_umap).mark_circle(size=40, opacity=0.7).encode(
                            x=alt.X('umap_x:Q', title='UMAP Dimension 1'),
                            y=alt.Y('umap_y:Q', title='UMAP Dimension 2'),
                            color=alt.Color('author_group:N', title='Author', scale=alt.Scale(scheme='tableau20')),
                            tooltip=['author', 'umap_x', 'umap_y']
                        ).properties(
                            title='UMAP 2D Projection by Author',
                            width=800,
                            height=600
                        ).interactive()
                        
                        display(scatter)
                        
                        print(f"\nTotal points: {len(df_umap)}")
                        print(f"Unique authors: {df_umap['author'].nunique()}")
                        print(f"Top 15 authors shown in legend (others grouped as 'Other')")
                    else:
                        print("Could not find required columns in authors table")
                else:
                    print("Could not find required columns in post_authors table")
            else:
                print(f"Could not find post_id column in {table_name} table")
        else:
            # Fallback: create plot without author coloring
            df_umap = df[['umap_x', 'umap_y']].dropna()
            
            scatter = alt.Chart(df_umap).mark_circle(size=30, opacity=0.6).encode(
                x=alt.X('umap_x:Q', title='UMAP Dimension 1'),
                y=alt.Y('umap_y:Q', title='UMAP Dimension 2'),
                tooltip=['umap_x', 'umap_y']
            ).properties(
                title='UMAP 2D Projection',
                width=700,
                height=600
            ).interactive()
            
            display(scatter)
            
            print(f"\nTotal points: {len(df_umap)}")
            print("Note: Author coloring not available (missing required tables)")
        
        break

if not umap_found:
    print("No UMAP coordinates (umap_x, umap_y) found in any table")

Found UMAP coordinates in table: posts



Total points: 5021
Unique authors: 1127
Top 15 authors shown in legend (others grouped as 'Other')


### 3D Embedding Visualization

Visualize the high-dimensional embeddings in 3D space using PCA for dimensionality reduction.


In [16]:
import numpy as np
import plotly.graph_objects as go
import json

# Check if posts table has embedding column
if 'posts' in dataframes:
    df_posts = dataframes['posts']
    
    if 'embedding' in df_posts.columns:
        print("Found embedding column in posts table")
        
        # Extract embeddings and convert to array
        embeddings_3d = []
        valid_indices = []
        
        for idx, embedding in enumerate(df_posts['embedding']):
            try:
                # Handle different embedding formats (string, list, array, bytes)
                if isinstance(embedding, bytes):
                    emb_array = np.array(json.loads(embedding.decode('utf-8')))
                elif isinstance(embedding, str):
                    emb_array = np.array(json.loads(embedding))
                elif isinstance(embedding, (list, tuple)):
                    emb_array = np.array(embedding)
                else:
                    emb_array = embedding
                
                if emb_array is not None and len(emb_array) >= 3:
                    # Take only the first 3 dimensions
                    embeddings_3d.append(emb_array[:3])
                    valid_indices.append(idx)
            except Exception as e:
                continue
        
        if embeddings_3d:
            # Convert to numpy array and ensure it's 2D (n_embeddings, 3)
            embeddings_3d = np.array(embeddings_3d)
            if embeddings_3d.ndim == 1:
                embeddings_3d = embeddings_3d.reshape(-1, 3)
            print(f"Extracted {len(embeddings_3d)} embeddings with shape {embeddings_3d.shape}")
            
            # Create a dataframe with 3D coordinates
            df_3d = pd.DataFrame({
                'dim_1': embeddings_3d[:, 0],
                'dim_2': embeddings_3d[:, 1],
                'dim_3': embeddings_3d[:, 2]
            })
            
            # Try to add author information
            if 'post_authors' in dataframes and 'authors' in dataframes:
                try:
                    df_post_authors = dataframes['post_authors']
                    df_authors = dataframes['authors']
                    
                    # Get author names for valid indices
                    authors = []
                    for idx in valid_indices:
                        post_id = df_posts.iloc[idx]['id'] if 'id' in df_posts.columns else None
                        if post_id is not None:
                            author_rows = df_post_authors[df_post_authors['post_id'] == post_id]
                            if not author_rows.empty:
                                author_id = author_rows.iloc[0]['author_id']
                                author_name = df_authors[df_authors['id'] == author_id]['name'].values
                                authors.append(author_name[0] if len(author_name) > 0 else 'Unknown')
                            else:
                                authors.append('Unknown')
                        else:
                            authors.append('Unknown')
                    
                    df_3d['author'] = authors
                    
                    # Get top 10 authors for coloring
                    top_authors = df_3d['author'].value_counts().head(10).index.tolist()
                    df_3d['author_group'] = df_3d['author'].apply(
                        lambda x: x if x in top_authors else 'Other'
                    )
                    
                    # Create 3D scatter plot with Plotly
                    fig = go.Figure(data=[go.Scatter3d(
                        x=df_3d['dim_1'],
                        y=df_3d['dim_2'],
                        z=df_3d['dim_3'],
                        mode='markers',
                        marker=dict(
                            size=4,
                            color=[top_authors.index(author) if author in top_authors else len(top_authors) 
                                   for author in df_3d['author_group']],
                            colorscale='Viridis',
                            showscale=True,
                            colorbar=dict(title="Author Group"),
                            opacity=0.7
                        ),
                        text=df_3d['author'],
                        hovertemplate='<b>%{text}</b><br>Dim 1: %{x:.3f}<br>Dim 2: %{y:.3f}<br>Dim 3: %{z:.3f}<extra></extra>'
                    )])
                except Exception as e:
                    print(f"Could not add author coloring: {e}")
                    # Fallback: create plot without author coloring
                    fig = go.Figure(data=[go.Scatter3d(
                        x=df_3d['dim_1'],
                        y=df_3d['dim_2'],
                        z=df_3d['dim_3'],
                        mode='markers',
                        marker=dict(size=4, opacity=0.7, color='blue'),
                        hovertemplate='Dim 1: %{x:.3f}<br>Dim 2: %{y:.3f}<br>Dim 3: %{z:.3f}<extra></extra>'
                    )])
            else:
                # Create 3D scatter plot without author coloring
                fig = go.Figure(data=[go.Scatter3d(
                    x=df_3d['dim_1'],
                    y=df_3d['dim_2'],
                    z=df_3d['dim_3'],
                    mode='markers',
                    marker=dict(size=4, opacity=0.7, color='blue'),
                    hovertemplate='Dim 1: %{x:.3f}<br>Dim 2: %{y:.3f}<br>Dim 3: %{z:.3f}<extra></extra>'
                )])
            
            fig.update_layout(
                title='3D Visualization of Post Embeddings (First 3 Dimensions)',
                scene=dict(
                    xaxis_title='Embedding Dimension 1',
                    yaxis_title='Embedding Dimension 2',
                    zaxis_title='Embedding Dimension 3'
                ),
                width=900,
                height=700
            )
            
            fig.show()
        else:
            print("No valid embeddings found")
    else:
        print("No 'embedding' column found in posts table")
else:
    print("No 'posts' table found in database")


Found embedding column in posts table
No valid embeddings found
