{ "cells": [ { "cell_type": "markdown", "id": "83885e86-1ccb-46ec-bee9-a33f3b541569", "metadata": {}, "source": [ "# Zusammenfassung der Analysen vom Hackathon für die Webside\n", "\n", "- womöglich zur Darstellung auf der Webside\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "9bd1686f-9bbc-4c05-a5f5-e0c4ce653fb2", "metadata": { "tags": [] }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import altair as alt" ] }, { "cell_type": "markdown", "id": "81780c9a-7721-438b-9726-ff5a70910ce8", "metadata": {}, "source": [ "## Daten aufbereitung\n", "\n", "Dump der Datenbank vom 25.03.2023. Die verschiedene Tabellen der Datenbank werden einzeln eingelesen. Zusätzlich werden alle direkt zu einem Tweet zugehörige Information in ein Datenobjekt gesammelt. Die Informationen zu den GIS-Daten zu den einzelnen Polizeistadtion (\"police_stations\") sind noch unvollständig und müssen gegebenfalls nocheinmal überprüft werden.\n", "\n" ] }, { "cell_type": "code", "execution_count": 119, "id": "fcc48831-7999-4d79-b722-736715b1ced6", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "((479991, 3), (151690, 8), (151690, 4), (13327, 5))" ] }, "execution_count": 119, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tweets_meta = pd.concat([pd.read_csv(\"data/entity_old.tsv\", sep = \"\\t\"), # data from old scraper\n", " pd.read_csv(\"data/tweets.csv\")]) # data from new scraper\n", "\n", "tweets_text = pd.concat([pd.read_csv(\"data/tweet_old.tsv\", sep = \"\\t\")[['id', \n", " 'tweet_text', \n", " 'created_at', \n", " 'user_id']].rename(columns = {\"id\":\"tweet_id\"}),\n", " pd.read_csv(\"data/tweets-1679742698645.csv\")])\n", "\n", "tweets_statistics = pd.concat([pd.read_csv(\"data/tweet_old.tsv\", sep = \"\\t\")[['id', \n", " 'like_count', \n", " 'retweet_count', \n", " 'reply_count', \n", " 'quote_count']].rename(columns = {\"id\":\"tweet_id\"}),\n", " pd.read_csv(\"data/tweets-1679742620302.csv\")])\n", "\n", "tweets_user = pd.read_csv(\"data/user_old.tsv\", \n", " sep = \"\\t\").rename(columns = {\"id\":\"user_id\",\"name\": \"user_name\"}\n", " ).merge(pd.read_csv(\"data/tweets-1679742702794.csv\"\n", " ).rename(columns = {\"username\":\"handle\", \"handle\": \"user_name\"}),\n", " on = \"user_id\",\n", " how = \"outer\",\n", " suffixes = [\"_2021\", \"_2022\"])\n", "\n", "tweets_meta.shape, tweets_statistics.shape, tweets_text.shape, tweets_user.shape" ] }, { "cell_type": "markdown", "id": "0f7b2b95-0a6c-42c6-a308-5f68d4ba94b9", "metadata": {}, "source": [ "Jetzt können noch alle Tweet bezogenen informationen in einem Data Frame gespeichert werden:" ] }, { "cell_type": "code", "execution_count": 150, "id": "cf409591-74a0-48dc-8f9e-66f7229f58cd", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "tweet_id int64\n", "like_count int64\n", "retweet_count int64\n", "reply_count int64\n", "quote_count int64\n", "measured_at object\n", "is_deleted float64\n", "tweet_text object\n", "created_at object\n", "user_id int64\n", "user_name_2021 object\n", "handle_2021 object\n", "handle_2022 object\n", "user_name_2022 object\n", "dtype: object" ] }, "execution_count": 150, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tweets_combined = pd.merge(tweets_statistics, \n", " tweets_text,\n", " on = 'tweet_id').merge(tweets_user, on = 'user_id'\n", " ).drop(['id'], axis = 1) # drop unascessary id column (redundant to index)\n", " \n", "# Convert Counts to integer values\n", "tweets_combined[['like_count', 'retweet_count', 'reply_count', 'quote_count']] = tweets_combined[['like_count', 'retweet_count', 'reply_count', 'quote_count']].fillna(-99).astype(int)\n", "tweets_combined.dtypes" ] }, { "cell_type": "code", "execution_count": 44, "id": "e312a975-3921-44ee-a7c5-37736678bc3f", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", " | user_id | \n", "handle | \n", "username | \n", "
---|---|---|---|
0 | \n", "1000004686156652545 | \n", "6jannik9 | \n", "Systemstratege: | \n", "
1 | \n", "1000043230870867969 | \n", "lsollik | \n", "Physiolucy | \n", "
2 | \n", "1000405847460151296 | \n", "achim1949hans | \n", "Systemstratege: | \n", "
3 | \n", "1000460805719121921 | \n", "wahrew | \n", "WahreWorte | \n", "
4 | \n", "1000744009638252544 | \n", "derd1ck3 | \n", "Ⓓ①ⓒⓚ①③ (🏡) | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
11554 | \n", "99931264 | \n", "havok1975 | \n", "Systemstratege: | \n", "
11555 | \n", "999542638226403328 | \n", "madame_de_saxe | \n", "Systemstratege: | \n", "
11556 | \n", "999901133282754560 | \n", "tungstendie74 | \n", "Systemstratege: | \n", "
11557 | \n", "999904275080794112 | \n", "_danielheim | \n", "Systemstratege: | \n", "
11558 | \n", "999955376454930432 | \n", "amyman6010 | \n", "Systemstratege: | \n", "
11559 rows × 3 columns
\n", "\n", " | handle | \n", "count | \n", "Name | \n", "Typ | \n", "Bundesland | \n", "Stadt | \n", "LAT | \n", "LONG | \n", "
---|---|---|---|---|---|---|---|---|
11 | \n", "polizei_ffm | \n", "2993 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
3 | \n", "polizei_nrw_do | \n", "2860 | \n", "Polizei NRW DO | \n", "Polizei | \n", "Nordrhein-Westfalen | \n", "Dortmund | \n", "51.5142273 | \n", "7.4652789 | \n", "
0 | \n", "polizeisachsen | \n", "2700 | \n", "Polizei Sachsen | \n", "Polizei | \n", "Sachsen | \n", "Dresden | \n", "51.0493286 | \n", "13.7381437 | \n", "
91 | \n", "polizeibb | \n", "2310 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
61 | \n", "polizeihamburg | \n", "2093 | \n", "Polizei Hamburg | \n", "Polizei | \n", "Hamburg | \n", "Hamburg | \n", "53.550341 | \n", "10.000654 | \n", "
\n", " | user_id | \n", "handle | \n", "user_name | \n", "
---|---|---|---|
0 | \n", "1000004686156652545 | \n", "6jannik9 | \n", "Systemstratege: | \n", "
1 | \n", "1000043230870867969 | \n", "LSollik | \n", "Physiolucy | \n", "
2 | \n", "1000405847460151296 | \n", "Achim1949Hans | \n", "Systemstratege: | \n", "
3 | \n", "1000460805719121921 | \n", "WahreW | \n", "WahreWorte | \n", "
4 | \n", "1000744009638252544 | \n", "derD1ck3 | \n", "Ⓓ①ⓒⓚ①③ (🏡) | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
11554 | \n", "99931264 | \n", "Havok1975 | \n", "Systemstratege: | \n", "
11555 | \n", "999542638226403328 | \n", "Madame_de_Saxe | \n", "Systemstratege: | \n", "
11556 | \n", "999901133282754560 | \n", "tungstendie74 | \n", "Systemstratege: | \n", "
11557 | \n", "999904275080794112 | \n", "_danielheim | \n", "Systemstratege: | \n", "
11558 | \n", "999955376454930432 | \n", "amyman6010 | \n", "Systemstratege: | \n", "
11559 rows × 3 columns
\n", "\n", " | like_count | \n", "retweet_count | \n", "reply_count | \n", "quote_count | \n", "
---|---|---|---|---|
0 | \n", "2 | \n", "1 | \n", "2 | \n", "0 | \n", "
1 | \n", "2 | \n", "0 | \n", "0 | \n", "0 | \n", "
2 | \n", "19 | \n", "3 | \n", "3 | \n", "0 | \n", "
3 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
4 | \n", "2 | \n", "0 | \n", "0 | \n", "0 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
151685 | \n", "5 | \n", "1 | \n", "1 | \n", "0 | \n", "
151686 | \n", "2 | \n", "0 | \n", "0 | \n", "0 | \n", "
151687 | \n", "6 | \n", "0 | \n", "0 | \n", "0 | \n", "
151688 | \n", "2 | \n", "0 | \n", "0 | \n", "0 | \n", "
151689 | \n", "10 | \n", "1 | \n", "0 | \n", "0 | \n", "
151690 rows × 4 columns
\n", "