{ "cells": [ { "cell_type": "markdown", "id": "83885e86-1ccb-46ec-bee9-a33f3b541569", "metadata": {}, "source": [ "# Zusammenfassung der Analysen vom Hackathon für die Webside\n", "\n", "- womöglich zur Darstellung auf der Webside\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "9bd1686f-9bbc-4c05-a5f5-e0c4ce653fb2", "metadata": { "tags": [] }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import altair as alt" ] }, { "cell_type": "markdown", "id": "81780c9a-7721-438b-9726-ff5a70910ce8", "metadata": {}, "source": [ "## Daten aufbereitung\n", "\n", "Dump der Datenbank vom 25.03.2023. Die verschiedene Tabellen der Datenbank werden einzeln eingelesen. Zusätzlich werden alle direkt zu einem Tweet zugehörige Information in ein Datenobjekt gesammelt. Die Informationen zu den GIS-Daten zu den einzelnen Polizeistadtion (\"police_stations\") sind noch unvollständig und müssen gegebenfalls nocheinmal überprüft werden.\n", "\n" ] }, { "cell_type": "code", "execution_count": 117, "id": "fcc48831-7999-4d79-b722-736715b1ced6", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "((479991, 3), (151690, 8), (151690, 4), (13327, 3))" ] }, "execution_count": 117, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tweets_meta = pd.concat([pd.read_csv(\"data/entity_old.tsv\", sep = \"\\t\"), # data from old scraper\n", " pd.read_csv(\"data/tweets.csv\")]) # data from new scraper\n", "\n", "tweets_text = pd.concat([pd.read_csv(\"data/tweet_old.tsv\", sep = \"\\t\")[['id', \n", " 'tweet_text', \n", " 'created_at', \n", " 'user_id']].rename(columns = {\"id\":\"tweet_id\"}),\n", " pd.read_csv(\"data/tweets-1679742698645.csv\")])\n", "\n", "tweets_statistics = pd.concat([pd.read_csv(\"data/tweet_old.tsv\", sep = \"\\t\")[['id', \n", " 'like_count', \n", " 'retweet_count', \n", " 'reply_count', \n", " 'quote_count']].rename(columns = {\"id\":\"tweet_id\"}),\n", " pd.read_csv(\"data/tweets-1679742620302.csv\")])\n", "\n", "tweets_user = pd.read_csv(\"data/user_old.tsv\", \n", " sep = \"\\t\").rename(columns = {\"id\":\"user_id\",\"name\": \"user_name\"}\n", " ).merge(pd.read_csv(\"data/tweets-1679742702794.csv\"\n", " ).rename(columns = {\"username\":\"handle\", \"handle\": \"user_name\"}),\n", " on = \"user_id\",\n", " how = \"outer\",\n", " suffixes = [\"_2021\", \"_2022\"])\n", "\n", "# Some usernames corresponding to one user_id have changed overtime. For easier handling only the latest username and handle is kept\n", "tweets_user = tweets_user.assign(handle = tweets_user.apply(lambda row: row['handle_2021'] if pd.isna(row['handle_2022']) else row['handle_2022'], axis=1),\n", " user_name = tweets_user.apply(lambda row: row['user_name_2021'] if pd.isna(row['user_name_2022']) else row['user_name_2022'], axis=1)\n", " ).drop(['handle_2021', 'handle_2022', 'user_name_2021', 'user_name_2022'], axis =1)\n", "\n", "police_stations = pd.read_csv(\"data/polizei_accounts_geo.csv\", sep = \"\\t\" # addiditional on police stations\n", " ).rename(columns = {\"Polizei Account\": \"handle\"})\n", "\n", "tweets_meta.shape, tweets_statistics.shape, tweets_text.shape, tweets_user.shape" ] }, { "cell_type": "markdown", "id": "0f7b2b95-0a6c-42c6-a308-5f68d4ba94b9", "metadata": {}, "source": [ "Jetzt können noch alle Tweet bezogenen informationen in einem Data Frame gespeichert werden:" ] }, { "cell_type": "code", "execution_count": 118, "id": "f30c2799-02c6-4e6a-ae36-9e039545b6b3", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Merge like statistics, tweet text and user information in one data frame\n", "tweets_combined = pd.merge(tweets_statistics, \n", " tweets_text,\n", " on = 'tweet_id').merge(tweets_user, on = 'user_id'\n", " ).drop(['id'], axis = 1) # drop unascessary id column (redundant to index)\n", " " ] }, { "cell_type": "code", "execution_count": 119, "id": "bd407aba-eec1-41ed-bff9-4c5fcdf6cb9d", "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/nix/store/4105l1v2llsjz4j7qaqsz0fljc9z0z2r-python3-3.10.9-env/lib/python3.10/site-packages/IPython/lib/pretty.py:778: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.\n", " output = repr(obj)\n", "/nix/store/4105l1v2llsjz4j7qaqsz0fljc9z0z2r-python3-3.10.9-env/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.\n", " return method()\n" ] }, { "data": { "text/html": [ "
\n", " | tweet_id | \n", "like_count | \n", "retweet_count | \n", "reply_count | \n", "quote_count | \n", "measured_at | \n", "is_deleted | \n", "tweet_text | \n", "created_at | \n", "user_id | \n", "handle | \n", "user_name | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1321021123463663616 | \n", "2 | \n", "1 | \n", "2 | \n", "0 | \n", "NaT | \n", "NaN | \n", "@mahanna196 Da die Stadt keine Ausnahme für Ra... | \n", "2020-10-27 09:29:13 | \n", "778895426007203840 | \n", "polizei_ol | \n", "Polizei Oldenburg-Stadt/Ammerland | \n", "
1 | \n", "1321037834246066181 | \n", "2 | \n", "0 | \n", "0 | \n", "0 | \n", "NaT | \n", "NaN | \n", "@mahanna196 Ja. *sr | \n", "2020-10-27 10:35:38 | \n", "778895426007203840 | \n", "polizei_ol | \n", "Polizei Oldenburg-Stadt/Ammerland | \n", "
2 | \n", "1321068234955776000 | \n", "19 | \n", "3 | \n", "3 | \n", "0 | \n", "NaT | \n", "NaN | \n", "#Aktuell Auf dem ehem. Bundeswehrkrankenhausge... | \n", "2020-10-27 12:36:26 | \n", "778895426007203840 | \n", "polizei_ol | \n", "Polizei Oldenburg-Stadt/Ammerland | \n", "
3 | \n", "1321073940199100416 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "NaT | \n", "NaN | \n", "@Emma36166433 Bitte lesen Sie unseren Tweet 2/... | \n", "2020-10-27 12:59:06 | \n", "778895426007203840 | \n", "polizei_ol | \n", "Polizei Oldenburg-Stadt/Ammerland | \n", "
4 | \n", "1321088646506754049 | \n", "2 | \n", "0 | \n", "0 | \n", "0 | \n", "NaT | \n", "NaN | \n", "In der vergangenen Woche wurde die Wohnung des... | \n", "2020-10-27 13:57:32 | \n", "778895426007203840 | \n", "polizei_ol | \n", "Polizei Oldenburg-Stadt/Ammerland | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
151685 | \n", "1625828803804004354 | \n", "5 | \n", "1 | \n", "1 | \n", "0 | \n", "2023-02-19 13:40:36 | \n", "False | \n", "#Sicherheit durch #Sichtbarkeit\\nUnsere #Dir3 ... | \n", "2023-02-15 12:06:07 | \n", "1168873095614160896 | \n", "polizeiberlin_p | \n", "Polizei Berlin Prävention | \n", "
151686 | \n", "1628004105623900167 | \n", "2 | \n", "0 | \n", "0 | \n", "0 | \n", "2023-02-25 13:14:49 | \n", "False | \n", "Unser Präventionsteam vom #A44 berät heute und... | \n", "2023-02-21 12:10:00 | \n", "1168873095614160896 | \n", "polizeiberlin_p | \n", "Polizei Berlin Prävention | \n", "
151687 | \n", "1628004810183016448 | \n", "6 | \n", "0 | \n", "0 | \n", "0 | \n", "2023-02-25 13:14:49 | \n", "False | \n", "Auch unser #A52 war heute aktiv und hat zum Th... | \n", "2023-02-21 12:12:48 | \n", "1168873095614160896 | \n", "polizeiberlin_p | \n", "Polizei Berlin Prävention | \n", "
151688 | \n", "1628352896352878593 | \n", "2 | \n", "0 | \n", "0 | \n", "0 | \n", "2023-02-26 13:15:05 | \n", "False | \n", "Gestern führte unser #A13 in einer Wohnsiedlun... | \n", "2023-02-22 11:15:58 | \n", "1168873095614160896 | \n", "polizeiberlin_p | \n", "Polizei Berlin Prävention | \n", "
151689 | \n", "1628709531998998529 | \n", "10 | \n", "1 | \n", "0 | \n", "0 | \n", "2023-02-27 12:17:33 | \n", "False | \n", "Auf dem Gelände der @BUFAStudios (Oberlandstr.... | \n", "2023-02-23 10:53:07 | \n", "1168873095614160896 | \n", "polizeiberlin_p | \n", "Polizei Berlin Prävention | \n", "
151690 rows × 12 columns
\n", "\n", " | handle | \n", "count | \n", "Name | \n", "Typ | \n", "Bundesland | \n", "Stadt | \n", "LAT | \n", "LONG | \n", "
---|---|---|---|---|---|---|---|---|
11 | \n", "polizei_ffm | \n", "5512 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
0 | \n", "polizeisachsen | \n", "5340 | \n", "Polizei Sachsen | \n", "Polizei | \n", "Sachsen | \n", "Dresden | \n", "51.0493286 | \n", "13.7381437 | \n", "
3 | \n", "polizei_nrw_do | \n", "4895 | \n", "Polizei NRW DO | \n", "Polizei | \n", "Nordrhein-Westfalen | \n", "Dortmund | \n", "51.5142273 | \n", "7.4652789 | \n", "
92 | \n", "polizeibb | \n", "4323 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
61 | \n", "polizeihamburg | \n", "4042 | \n", "Polizei Hamburg | \n", "Polizei | \n", "Hamburg | \n", "Hamburg | \n", "53.550341 | \n", "10.000654 | \n", "
\n", " | user_id | \n", "handle | \n", "user_name | \n", "
---|---|---|---|
0 | \n", "1000004686156652545 | \n", "6jannik9 | \n", "Systemstratege: | \n", "
1 | \n", "1000043230870867969 | \n", "LSollik | \n", "Physiolucy | \n", "
2 | \n", "1000405847460151296 | \n", "Achim1949Hans | \n", "Systemstratege: | \n", "
3 | \n", "1000460805719121921 | \n", "WahreW | \n", "WahreWorte | \n", "
4 | \n", "1000744009638252544 | \n", "derD1ck3 | \n", "Ⓓ①ⓒⓚ①③ (🏡) | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
11554 | \n", "99931264 | \n", "Havok1975 | \n", "Systemstratege: | \n", "
11555 | \n", "999542638226403328 | \n", "Madame_de_Saxe | \n", "Systemstratege: | \n", "
11556 | \n", "999901133282754560 | \n", "tungstendie74 | \n", "Systemstratege: | \n", "
11557 | \n", "999904275080794112 | \n", "_danielheim | \n", "Systemstratege: | \n", "
11558 | \n", "999955376454930432 | \n", "amyman6010 | \n", "Systemstratege: | \n", "
11559 rows × 3 columns
\n", "\n", " | \n", " | count | \n", "
---|---|---|
user_id | \n", "user_name | \n", "\n", " |
223758384 | \n", "Polizei Sachsen | \n", "5340 | \n", "
259607457 | \n", "Polizei NRW K | \n", "2544 | \n", "
424895827 | \n", "Polizei Stuttgart | \n", "1913 | \n", "
769128278 | \n", "Polizei NRW DO | \n", "4895 | \n", "
775664780 | \n", "Polizei Rostock | \n", "604 | \n", "
... | \n", "... | \n", "... | \n", "
1169206134189830145 | \n", "Polizei Stendal | \n", "842 | \n", "
1184022676488314880 | \n", "Polizei Pforzheim | \n", "283 | \n", "
1184024283342950401 | \n", "Polizei Ravensburg | \n", "460 | \n", "
1232548941889228808 | \n", "Systemstratege: | \n", "168 | \n", "
1295978598034284546 | \n", "Polizei ZPD NI | \n", "133 | \n", "
163 rows × 1 columns
\n", "