{"id":2093,"date":"2020-10-16T18:34:12","date_gmt":"2020-10-16T23:34:12","guid":{"rendered":"https:\/\/mathewkiangcom.local\/?p=2093"},"modified":"2020-11-26T19:00:27","modified_gmt":"2020-11-27T00:00:27","slug":"code-for-deleting-old-tweets","status":"publish","type":"post","link":"https:\/\/mathewkiangcom.local\/2020\/10\/16\/code-for-deleting-old-tweets\/","title":{"rendered":"Applying an intro-level networks concept to deleting tweets"},"content":{"rendered":"T<\/span>here are a few services<\/a> out there that will delete your old tweets for you, but I wanted to delete tweets with a bit more control. For example, there are some tweets I need to keep up for whatever reason (e.g., I need it for verification<\/a>) or a few jokes I’m proud of and don’t want to delete.<\/p>\n

If you just want the R<\/code> code to delete some tweets based on age and likes, here it is<\/a> (noting that it is based on Chris Albon’s Python script<\/a>). In this post, I go over a bit of code about what I thought was an interesting problem: given a list of tweets, how can we identify and group threads?<\/p>\n

Below, I plot all my tweets over time (x-axis) by the number of “likes” (y-axis) and I highlight in red tweets that are threaded together. Ignore the boxes for now.<\/p>\n

\"\"<\/a><\/p>\n

Pulling the data using rtweet<\/code><\/a>, you end up with a dataframe that looks something like this (only with many many more rows and columns):<\/p>\n

> before_df %>% select(status_id, created_at, screen_name, text, reply_to_status_id)\r\n# A tibble: 510 x 5\r\n   status_id    created_at          screen_name text                           reply_to_status\u2026\r\n   <chr>        <dttm>              <chr>       <chr>                          <chr>           \r\n 1 13167994911\u2026 2020-10-15 17:53:58 mathewkiang \"@JosephPalamar Haha \u2014 the on\u2026 131679883199713\u2026\r\n 2 13167817880\u2026 2020-10-15 16:43:37 mathewkiang \"@khayeswilson Ah \\\"self-liki\u2026 131678029839121\u2026\r\n 3 13167812579\u2026 2020-10-15 16:41:31 mathewkiang \"@khayeswilson AH! This is pe\u2026 131678029839121\u2026\r\n 4 13167755278\u2026 2020-10-15 16:18:44 mathewkiang \"I've been coding up a script\u2026 NA              \r\n 5 13165350914\u2026 2020-10-15 00:23:20 mathewkiang \"https:\/\/t.co\/7PtlUKWTeU http\u2026 NA              \r\n 6 13161332751\u2026 2020-10-13 21:46:40 mathewkiang \"Data: Full-time academic job\u2026 NA              \r\n 7 13144052234\u2026 2020-10-09 03:20:00 mathewkiang \"@simonw Thanks for the info!\u2026 131439639207741\u2026\r\n 8 13143912914\u2026 2020-10-09 02:24:38 mathewkiang \"@simonw Does this include da\u2026 131439055526721\u2026\r\n 9 13142896495\u2026 2020-10-08 19:40:45 mathewkiang \"Me: This paper has been out \u2026 NA              \r\n10 13136475049\u2026 2020-10-07 01:09:06 mathewkiang \"@Doc_Courtney If by \u201cpassing\u2026 131364337679803\u2026\r\n# \u2026 with 500 more rows<\/pre>\n

To select the tweets you want to delete, it is straight-forward to make a rule like: (1) delete all tweets created more than two years ago with fewer than 100 likes (left-most grey box in the plot) or (2) delete all tweets created more than 90 days ago with fewer than 25 likes (bottom grey box in the plot). You could even create a function where the number of likes must be exponentially higher over time. And obviously, you can create a list of tweets (status_id<\/code>s) that you want to keep.<\/p>\n

However, this assumes all tweets are independent. Things get a bit more complicated if you want to treat sets of tweets with the score of any single tweet in the set. If, for example, you string together a twitter thread, you may want to delete or save the entire thread based only on the first tweet since deleting the “unliked” tweets will break up the thread. Twitter doesn’t provide a column that links threads together through a unified ID.<\/p>\n

After chatting with Malcolm Barrett<\/a> about it for a bit, I realized this is a fairly simple network problem. If you imagine the data frame above, where every row is a tweet, as an edge list between vertex status_id<\/code> and reply_to_status_id<\/code>, then you can remove all isolates to get trees of threads (most would just be chains). The key code is here<\/a> but to sketch out the broad points:<\/p>\n

    \n
  1. Take before_df<\/code> and (a) filter out isolates (e.g., non-threaded tweets) by making sure each tweet is referred to by another tweet or refers to another tweet within the data frame and (b) removing comments to other people by removing tweets that start with “@”. Because it’s an edge list, we will rename the columns to “from” and “to” and if there is no terminating vertex (i.e., it’s the first tweet in a thread), we will create a self-loop.\n
    before_df %>%\r\n        filter(status_id %in% reply_to_status_id | \r\n                   reply_to_status_id %in% status_id,\r\n               substr(text, 1, 1) != \"@\") %>% \r\n        select(from = status_id, to = reply_to_status_id) %>%\r\n        mutate(to = ifelse(is.na(to), from, to))<\/pre>\n<\/li>\n
  2. Now just convert this edge list into a graph and extract all the components using igraph<\/code>\n
    thread_assignments <- thread_df %>%\r\n        graph_from_data_frame(directed = TRUE) %>%\r\n        components()<\/pre>\n<\/li>\n
  3. Now you have a mapping of every threaded tweet ID to a component ID. Below, I just take this mapping and then create a new component ID that is the same as the starting tweet of the thread.\n
    id_mapping <- thread_df %>%\r\n        select(status_id = from) %>%\r\n        left_join(tibble(\r\n            status_id = names(thread_assignments$membership),\r\n            membership = thread_assignments$membership\r\n        )) %>%\r\n        group_by(membership) %>%\r\n        mutate(new_status_id = min(status_id)) %>%\r\n        ungroup()<\/pre>\n<\/li>\n<\/ol>\n

    That’s it! With this mapping, you can left_join()<\/code> the original data frame and perform manipulations on the thread as a group of tweets rather than each tweet individually. Anyways, check out the gist<\/code><\/a> to see how I deleted the tweets and implemented this. I just thought it was a nice, clean application of an introductory-level network concept to an applied data cleaning problem.<\/p>\n

    After deleting old and boring tweets and keeping tweets I liked (taking into account groups), I’m left with the black points above. The grey points were tweets that I ended up deleting.<\/p>\n

    (Disclaimer: There’s almost certainly a better way to do this \u2014 I just don’t know it.)<\/p>\n","protected":false},"excerpt":{"rendered":"

    here are a few services out there that will delete your old tweets for you, but I wanted to delete tweets with a bit more control. For example, there are some tweets I need to keep up for whatever reason (e.g., I need it for verification) or a few jokes I’m proud of and don’t want to delete. If you just want the R code to delete some tweets based on age and likes, here it is (noting that it is based on Chris Albon’s Python script). In this post, I go over a bit of code about what I …<\/p>\n","protected":false},"author":1,"featured_media":2094,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[132,116,44,1],"tags":[],"_links":{"self":[{"href":"https:\/\/mathewkiangcom.local\/wp-json\/wp\/v2\/posts\/2093"}],"collection":[{"href":"https:\/\/mathewkiangcom.local\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mathewkiangcom.local\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mathewkiangcom.local\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mathewkiangcom.local\/wp-json\/wp\/v2\/comments?post=2093"}],"version-history":[{"count":5,"href":"https:\/\/mathewkiangcom.local\/wp-json\/wp\/v2\/posts\/2093\/revisions"}],"predecessor-version":[{"id":2099,"href":"https:\/\/mathewkiangcom.local\/wp-json\/wp\/v2\/posts\/2093\/revisions\/2099"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/mathewkiangcom.local\/wp-json\/wp\/v2\/media\/2094"}],"wp:attachment":[{"href":"https:\/\/mathewkiangcom.local\/wp-json\/wp\/v2\/media?parent=2093"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mathewkiangcom.local\/wp-json\/wp\/v2\/categories?post=2093"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mathewkiangcom.local\/wp-json\/wp\/v2\/tags?post=2093"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}