{ "cells": [ { "cell_type": "markdown", "id": "83885e86-1ccb-46ec-bee9-a33f3b541569", "metadata": {}, "source": [ "# Zusammenfassung der Analysen vom Hackathon für die Webside\n", "\n", "- womöglich zur Darstellung auf der Webside\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "9bd1686f-9bbc-4c05-a5f5-e0c4ce653fb2", "metadata": { "tags": [] }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import altair as alt" ] }, { "cell_type": "markdown", "id": "81780c9a-7721-438b-9726-ff5a70910ce8", "metadata": {}, "source": [ "## Daten aufbereitung\n", "\n", "Dump der Datenbank vom 25.03.2023. Die verschiedene Tabellen der Datenbank werden einzeln eingelesen. Zusätzlich werden alle direkt zu einem Tweet zugehörige Information in ein Datenobjekt gesammelt. Die Informationen zu den GIS-Daten zu den einzelnen Polizeistadtion (\"police_stations\") sind noch unvollständig und müssen gegebenfalls nocheinmal überprüft werden.\n", "\n" ] }, { "cell_type": "code", "execution_count": 45, "id": "fcc48831-7999-4d79-b722-736715b1ced6", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "((479991, 3), (151690, 8), (151690, 4), (13327, 3), (163, 7))" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Merging different table of old (~2021) and new (~2022) scraper\n", "\n", "## cols: hashtag, url, mention (same for both)\n", "tweets_meta = pd.concat([pd.read_csv(\"data/entity_old.tsv\", sep = \"\\t\"), # data from old scraper\n", " pd.read_csv(\"data/tweets.csv\")]) # data from new scraper\n", "\n", "## cols: id, tweet_text, created_at, user_id; only subset from old table (same tsv used in next step)\n", "tweets_text = pd.concat([pd.read_csv(\"data/tweet_old.tsv\", sep = \"\\t\")[['id','tweet_text', 'created_at', 'user_id']].rename(columns = {\"id\":\"tweet_id\"}),\n", " pd.read_csv(\"data/tweets-1679742698645.csv\")])\n", "\n", "## cols: id, like_count, retweet_count, reply_count, quote_count; only subset from old table\n", "tweets_statistics = pd.concat([pd.read_csv(\"data/tweet_old.tsv\", sep = \"\\t\")[['id', 'like_count', 'retweet_count', 'reply_count', 'quote_count']].rename(columns = {\"id\":\"tweet_id\"}),\n", " pd.read_csv(\"data/tweets-1679742620302.csv\")])\n", "\n", "## cols: user_id, handle, user_name; colnames do not match betweend old an new data. Even username and handle seem to be mixed up in new data set (inverse order)\n", "## Info: Only a small amount of user_ids appear in both data sets, but if so username occasionaly have changed an therefore can not easily be merged\n", "tweets_user = pd.read_csv(\"data/user_old.tsv\", \n", " sep = \"\\t\").rename(columns = {\"id\":\"user_id\",\"name\": \"user_name\"} # uniform names\n", " ).merge(pd.read_csv(\"data/tweets-1679742702794.csv\" # merge with renamed new data\n", " ).rename(columns = {\"username\":\"handle\", \"handle\": \"user_name\"}), # reverse col names\n", " on = \"user_id\", # user_id as matching column\n", " how = \"outer\", # keep all unique uer_ids\n", " suffixes = [\"_2021\", \"_2022\"]) # identify column where username and label came from\n", "\n", "## Some usernames corresponding to one user_id have changed overtime. For easier handling only the latest username and handle is kept.\n", "tweets_user = tweets_user.assign(handle = tweets_user.apply(lambda row: row['handle_2021'] if pd.isna(row['handle_2022']) else row['handle_2022'], axis=1),\n", " user_name = tweets_user.apply(lambda row: row['user_name_2021'] if pd.isna(row['user_name_2022']) else row['user_name_2022'], axis=1)\n", " ).drop(['handle_2021', 'handle_2022', 'user_name_2021', 'user_name_2022'], axis =1) # no longer needed\n", "\n", "## addiditional information concerning the police stations\n", "## cols: handle, name, typ, bundesland, stadt, lat, long\n", "police_stations = pd.read_csv(\"data/polizei_accounts_geo.csv\", sep = \"\\t\" \n", " ).rename(columns = {\"Polizei Account\": \"handle\"})\n", "\n", "tweets_meta.shape, tweets_statistics.shape, tweets_text.shape, tweets_user.shape, police_stations.shape" ] }, { "cell_type": "markdown", "id": "0f7b2b95-0a6c-42c6-a308-5f68d4ba94b9", "metadata": {}, "source": [ "Jetzt können noch alle Tweet bezogenen informationen in einem Data Frame gespeichert werden:" ] }, { "cell_type": "code", "execution_count": 24, "id": "f30c2799-02c6-4e6a-ae36-9e039545b6b3", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Merge statistics, tweet text and user information in one data frame\n", "tweets_combined = pd.merge(tweets_statistics, \n", " tweets_text,\n", " on = 'tweet_id').merge(tweets_user, on = 'user_id'\n", " ).drop(['id'], axis = 1) # drop unascessary id column (redundant to index)\n", " " ] }, { "cell_type": "code", "execution_count": 49, "id": "bd407aba-eec1-41ed-bff9-4c5fcdf6cb9d", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", " | tweet_id | \n", "like_count | \n", "retweet_count | \n", "reply_count | \n", "quote_count | \n", "measured_at | \n", "is_deleted | \n", "tweet_text | \n", "created_at | \n", "user_id | \n", "handle | \n", "user_name | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1321021123463663616 | \n", "2 | \n", "1 | \n", "2 | \n", "0 | \n", "NaT | \n", "<NA> | \n", "@mahanna196 Da die Stadt keine Ausnahme für Radfahrer aufgeführt hat, gilt diese (Stand jetzt) a... | \n", "2020-10-27 09:29:13 | \n", "778895426007203840 | \n", "polizei_ol | \n", "Polizei Oldenburg-Stadt/Ammerland | \n", "
1 | \n", "1321037834246066181 | \n", "2 | \n", "0 | \n", "0 | \n", "0 | \n", "NaT | \n", "<NA> | \n", "@mahanna196 Ja. *sr | \n", "2020-10-27 10:35:38 | \n", "778895426007203840 | \n", "polizei_ol | \n", "Polizei Oldenburg-Stadt/Ammerland | \n", "
2 | \n", "1321068234955776000 | \n", "19 | \n", "3 | \n", "3 | \n", "0 | \n", "NaT | \n", "<NA> | \n", "#Aktuell Auf dem ehem. Bundeswehrkrankenhausgelände in #Rostrup wurde ein Sprengsatz gefunden. F... | \n", "2020-10-27 12:36:26 | \n", "778895426007203840 | \n", "polizei_ol | \n", "Polizei Oldenburg-Stadt/Ammerland | \n", "
3 | \n", "1321073940199100416 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "NaT | \n", "<NA> | \n", "@Emma36166433 Bitte lesen Sie unseren Tweet 2/2 *sr | \n", "2020-10-27 12:59:06 | \n", "778895426007203840 | \n", "polizei_ol | \n", "Polizei Oldenburg-Stadt/Ammerland | \n", "
4 | \n", "1321088646506754049 | \n", "2 | \n", "0 | \n", "0 | \n", "0 | \n", "NaT | \n", "<NA> | \n", "In der vergangenen Woche wurde die Wohnung des Tatverdächtigen durchsucht. Dabei stellten die Be... | \n", "2020-10-27 13:57:32 | \n", "778895426007203840 | \n", "polizei_ol | \n", "Polizei Oldenburg-Stadt/Ammerland | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
151685 | \n", "1625828803804004354 | \n", "5 | \n", "1 | \n", "1 | \n", "0 | \n", "2023-02-19 13:40:36 | \n", "False | \n", "#Sicherheit durch #Sichtbarkeit\\nUnsere #Dir3 hat zu diesem Thema wieder einmal die Puppen tanze... | \n", "2023-02-15 12:06:07 | \n", "1168873095614160896 | \n", "polizeiberlin_p | \n", "Polizei Berlin Prävention | \n", "
151686 | \n", "1628004105623900167 | \n", "2 | \n", "0 | \n", "0 | \n", "0 | \n", "2023-02-25 13:14:49 | \n", "False | \n", "Unser Präventionsteam vom #A44 berät heute und morgen tagsüber zum Thema Alkohol & Drogen + ... | \n", "2023-02-21 12:10:00 | \n", "1168873095614160896 | \n", "polizeiberlin_p | \n", "Polizei Berlin Prävention | \n", "
151687 | \n", "1628004810183016448 | \n", "6 | \n", "0 | \n", "0 | \n", "0 | \n", "2023-02-25 13:14:49 | \n", "False | \n", "Auch unser #A52 war heute aktiv und hat zum Thema Alkohol & Drogen im Straßenverkehr beraten... | \n", "2023-02-21 12:12:48 | \n", "1168873095614160896 | \n", "polizeiberlin_p | \n", "Polizei Berlin Prävention | \n", "
151688 | \n", "1628352896352878593 | \n", "2 | \n", "0 | \n", "0 | \n", "0 | \n", "2023-02-26 13:15:05 | \n", "False | \n", "Gestern führte unser #A13 in einer Wohnsiedlung einen Präventionseinsatz zum Thema „Wohnraumeinb... | \n", "2023-02-22 11:15:58 | \n", "1168873095614160896 | \n", "polizeiberlin_p | \n", "Polizei Berlin Prävention | \n", "
151689 | \n", "1628709531998998529 | \n", "10 | \n", "1 | \n", "0 | \n", "0 | \n", "2023-02-27 12:17:33 | \n", "False | \n", "Auf dem Gelände der @BUFAStudios (Oberlandstr. 26-35) findet heute die #Seniorenmesse vom Bezirk... | \n", "2023-02-23 10:53:07 | \n", "1168873095614160896 | \n", "polizeiberlin_p | \n", "Polizei Berlin Prävention | \n", "
151690 rows × 12 columns
\n", "\n", " | handle | \n", "count | \n", "Name | \n", "Typ | \n", "Bundesland | \n", "Stadt | \n", "LAT | \n", "LONG | \n", "
---|---|---|---|---|---|---|---|---|
11 | \n", "polizei_ffm | \n", "5512 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
0 | \n", "polizeisachsen | \n", "5340 | \n", "Polizei Sachsen | \n", "Polizei | \n", "Sachsen | \n", "Dresden | \n", "51.0493286 | \n", "13.7381437 | \n", "
3 | \n", "polizei_nrw_do | \n", "4895 | \n", "Polizei NRW DO | \n", "Polizei | \n", "Nordrhein-Westfalen | \n", "Dortmund | \n", "51.5142273 | \n", "7.4652789 | \n", "
92 | \n", "polizeibb | \n", "4323 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
61 | \n", "polizeihamburg | \n", "4042 | \n", "Polizei Hamburg | \n", "Polizei | \n", "Hamburg | \n", "Hamburg | \n", "53.550341 | \n", "10.000654 | \n", "
\n", " | index | \n", "tweet_id | \n", "like_count | \n", "retweet_count | \n", "reply_count | \n", "quote_count | \n", "measured_at | \n", "is_deleted | \n", "tweet_text | \n", "created_at | \n", "user_id | \n", "handle | \n", "user_name | \n", "Name | \n", "Typ | \n", "Bundesland | \n", "Stadt | \n", "LAT | \n", "LONG | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "3053 | \n", "1609539240458878979 | \n", "21455 | \n", "1845 | \n", "3643 | \n", "341 | \n", "2023-01-05 14:44:34 | \n", "False | \n", "Die Gewalt, die unsere Kolleginnen & Kollegen in der Silvesternacht erleben mussten, ist une... | \n", "2023-01-01 13:17:13 | \n", "2397974054 | \n", "polizeiberlin | \n", "Polizei Berlin | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
1 | \n", "1331 | \n", "1355179228396879872 | \n", "19186 | \n", "3386 | \n", "1203 | \n", "628 | \n", "NaT | \n", "NaN | \n", "An diejenigen, die vergangene Nacht in eine Schule in #Gesundbrunnen eingebrochen sind und 242 T... | \n", "2021-01-29 15:41:20 | \n", "2397974054 | \n", "polizeiberlin | \n", "Polizei Berlin | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
2 | \n", "91693 | \n", "1505620459148173316 | \n", "15708 | \n", "7098 | \n", "186 | \n", "540 | \n", "2022-03-24 20:15:08 | \n", "False | \n", "WICHTIGE Info:\\nÜber das Internet wird derzeit ein Video verbreitet, in dem von einem Überfall a... | \n", "2022-03-20 19:01:05 | \n", "2389161066 | \n", "polizei_nrw_bn | \n", "Polizei NRW BN | \n", "Polizei NRW BN | \n", "Polizei | \n", "Nordrhein-Westfalen | \n", "Bonn | \n", "50.735851 | \n", "7.10066 | \n", "
3 | \n", "91695 | \n", "1505620666476896259 | \n", "10337 | \n", "1539 | \n", "59 | \n", "35 | \n", "2022-03-24 20:15:08 | \n", "False | \n", "Die Experten gehen derzeit davon aus, dass es sich um ein absichtliches \"Fake-Video\" handelt, da... | \n", "2022-03-20 19:01:54 | \n", "2389161066 | \n", "polizei_nrw_bn | \n", "Polizei NRW BN | \n", "Polizei NRW BN | \n", "Polizei | \n", "Nordrhein-Westfalen | \n", "Bonn | \n", "50.735851 | \n", "7.10066 | \n", "
4 | \n", "122631 | \n", "1359098196434292739 | \n", "9471 | \n", "642 | \n", "128 | \n", "102 | \n", "NaT | \n", "NaN | \n", "Weil wir dich schieben! @BVG_Kampagne 😉 https://t.co/N8kdlCxhz2 | \n", "2021-02-09 11:13:55 | \n", "4876039738 | \n", "bpol_b | \n", "Bundespolizei Berlin | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
151685 | \n", "7569 | \n", "1332625325654757377 | \n", "-99 | \n", "-99 | \n", "-99 | \n", "-99 | \n", "NaT | \n", "NaN | \n", "Sinken die Temperaturen ❄, steigt zeitgleich das Risiko für Verkehrsteilnehmer. Höchste Zeit zu ... | \n", "2020-11-28 10:00:11 | \n", "223758384 | \n", "polizeisachsen | \n", "Polizei Sachsen | \n", "Polizei Sachsen | \n", "Polizei | \n", "Sachsen | \n", "Dresden | \n", "51.0493286 | \n", "13.7381437 | \n", "
151686 | \n", "7572 | \n", "1332738525507186692 | \n", "-99 | \n", "-99 | \n", "-99 | \n", "-99 | \n", "NaT | \n", "NaN | \n", "📺Am Sonntag, um 19:50 Uhr, geht es bei #KripoLive im \\n@mdrde\\n auch um die Fahndung nach einem ... | \n", "2020-11-28 17:30:00 | \n", "223758384 | \n", "polizeisachsen | \n", "Polizei Sachsen | \n", "Polizei Sachsen | \n", "Polizei | \n", "Sachsen | \n", "Dresden | \n", "51.0493286 | \n", "13.7381437 | \n", "
151687 | \n", "144702 | \n", "1465679768494526467 | \n", "-99 | \n", "-99 | \n", "-99 | \n", "-99 | \n", "NaT | \n", "NaN | \n", "Musik verbindet!\\nUnser #Adventskalender der #Bundespolizei startet morgen ➡ https://t.co/V6CaTV... | \n", "2021-11-30 13:51:02 | \n", "4876085224 | \n", "bpol_nord | \n", "Bundespolizei Nord | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
151688 | \n", "144701 | \n", "1464124290605977600 | \n", "-99 | \n", "-99 | \n", "-99 | \n", "-99 | \n", "NaT | \n", "NaN | \n", "@gretchen_hann Hallo, diese Frage kann die Bundespolizei Spezialkräfte besser beantworten. Richt... | \n", "2021-11-26 06:50:07 | \n", "4876085224 | \n", "bpol_nord | \n", "Bundespolizei Nord | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
151689 | \n", "66854 | \n", "1376453040283209728 | \n", "-99 | \n", "-99 | \n", "-99 | \n", "-99 | \n", "NaT | \n", "NaN | \n", "#Bönen #Holzwickede - Verstöße gegen Coronaschutzverordnung: Polizei löst Gaststättenabend und F... | \n", "2021-03-29 08:35:52 | \n", "2389263558 | \n", "polizei_nrw_un | \n", "Polizei NRW UN | \n", "Polizei NRW UN | \n", "Polizei | \n", "Nordrhein-Westfalen | \n", "Unna | \n", "51.5348835 | \n", "7.689014 | \n", "
151690 rows × 19 columns
\n", "\n", " | user_id | \n", "handle | \n", "user_name | \n", "
---|---|---|---|
0 | \n", "1000004686156652545 | \n", "6jannik9 | \n", "Systemstratege: | \n", "
1 | \n", "1000043230870867969 | \n", "LSollik | \n", "Physiolucy | \n", "
2 | \n", "1000405847460151296 | \n", "Achim1949Hans | \n", "Systemstratege: | \n", "
3 | \n", "1000460805719121921 | \n", "WahreW | \n", "WahreWorte | \n", "
4 | \n", "1000744009638252544 | \n", "derD1ck3 | \n", "Ⓓ①ⓒⓚ①③ (🏡) | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
11554 | \n", "99931264 | \n", "Havok1975 | \n", "Systemstratege: | \n", "
11555 | \n", "999542638226403328 | \n", "Madame_de_Saxe | \n", "Systemstratege: | \n", "
11556 | \n", "999901133282754560 | \n", "tungstendie74 | \n", "Systemstratege: | \n", "
11557 | \n", "999904275080794112 | \n", "_danielheim | \n", "Systemstratege: | \n", "
11558 | \n", "999955376454930432 | \n", "amyman6010 | \n", "Systemstratege: | \n", "
11559 rows × 3 columns
\n", "