Analyzing Planet Golang traffic with Julia

« back to all blogs

As you might know, I’m running planetgolang.dev - an unopinionated Go news aggregator. Now it’s been out for a while, I want to see some stats to check if it’s useful to people!

It would stand to reason to do it in Go - after all, it’s a Go site - but lately I’ve been figuring out my way around Julia. Since I intend to use Julia mostly in data science contexts, this feels like a good fit for some learning!

The site is static, distributed with AWS CloudFront. I’ve set it to collect logs in an S3 bucket and just… left it there. Hey, I never promised I’m the most organized person in the world. Anyway, these logs will need some lovin' before we can use them.

The AWS format is a bit weird. It’s kinda-sorta-TSV, but each file begins with a preamble like this:

#Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken x-forwarded-for ssl-protocol ssl-cipher x-edge-response-result-type cs-protocol-version fle-status fle-encrypted-fields c-port time-to-first-byte x-edge-detailed-result-type sc-content-type sc-content-len sc-range-start sc-range-end

Only after that do the actual TSV values start. Another issue is that since August last year, it generated a few files:

$ ls logs | wc -l
20664

So let’s concatenate them into one megafile first, then stuff all that into a Data Frame, and then have some fun with it!

First let’s write our megafile. It’ll hang out in logs/concatenated.csv. I figured out I want quoted after having gone all search-and-replace on a header row, so we’ll just quote that live as well. Not great, but it’s not like I’m putting these scribbles in production somewhere. The rest is relatively straightforward: read all files, skip the first two lines, massage TSV into quoted CSV, close the output descriptor. Hey presto!

quote_csv(s) = join(map(x -> "\"$x\"", split(s, ",")), ",")

headers = "date,time,x-edge-location,sc-bytes,c-ip,cs-method,cs(Host),cs-uri-stem,sc-status,cs(Referer),cs(User-Agent),cs-uri-query,cs(Cookie),x-edge-result-type,x-edge-request-id,x-host-header,cs-protocol,cs-bytes,time-taken,x-forwarded-for,ssl-protocol,ssl-cipher,x-edge-response-result-type,cs-protocol-version,fle-status,fle-encrypted-fields,c-port,time-to-first-byte,x-edge-detailed-result-type,sc-content-type,sc-content-len,sc-range-start,sc-range-end"

csv_output = open("logs/concatenated.csv", "w+")
write(csv_output, "$(quote_csv(headers))\n")

for file in glob("ELHTE4P8I823B*", "logs")
    open(file, "r") do io
        line = 0

        while !eof(io) 
            s = readline(io)
            line += 1

            if line < 3
                continue
            end

            s = split(s, "\t")
            write(csv_output, "$(quote_csv(s))\n")
        end
    end
end

close(csv_output)

Now we’ll pack it all up into a neat little data frame so we can have our fun. Most of those columns are not really useful to us, and spilling some of those would be GDPR-bad. I could do this cleanup at the time of reading it from the original files, but I’m lazy and data frames have pretty cool facilities for lazy people.

Let’s take a quick look at the data once we’re done reading it:

removed_columns = ["x-edge-location", "c-ip", "cs(Host)", "cs(Referer)", "cs-uri-query", "cs(Cookie)", "x-edge-result-type", "x-edge-request-id", "x-host-header", "cs-bytes", "time-taken", "x-forwarded-for", "ssl-protocol", "ssl-cipher", "x-edge-response-result-type", "cs-protocol-version", "fle-status", "fle-encrypted-fields", "c-port", "time-to-first-byte", "x-edge-detailed-result-type", "sc-content-len", "sc-range-start", "sc-range-end"]

df = CSV.read("logs/concatenated.csv", DataFrame)

df = select!(df, Not(removed_columns))

show(sort(df, order(:date, rev=true)), allcols=true)
39404×9 DataFrame
   Row │ date        time      sc-bytes  cs-method  cs-uri-stem  sc-status  cs(User-Agent)                     cs-protocol  sc-content-type 
       │ Date        Time      Int64     String7    String       Int64      String                             String7      String31        
───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     1 │ 2022-04-21  00:08:44     11434  GET        /                  200  Expanse%20indexes%20customers%E2…  https        text/html
     2 │ 2022-04-21  00:09:08       581  GET        /                  301  Expanse%20indexes%20customers%E2…  http         text/html
     3 │ 2022-04-21  00:56:01       377  GET        /index.xml         304  Mozilla/5.0%20(compatible;%20Min…  https        -
     4 │ 2022-04-21  00:20:30      2256  GET        /index.xml         200  FreshRSS/1.19.2%20(Linux;%20http…  https        application/xml
     5 │ 2022-04-21  00:25:46      2657  GET        /index.xml         200  Mozilla/5.0%20(compatible;%20Min…  https        application/xml
     6 │ 2022-04-21  00:42:21     11444  GET        /                  200  Expanse%20indexes%20customers%E2…  https        text/html
     7 │ 2022-04-21  01:53:44      4107  GET        /                  200  Mozilla/5.0%20(Windows%20NT%206.…  https        text/html
     8 │ 2022-04-21  01:54:06      2655  GET        /index.xml         200  Mozilla/5.0%20(compatible;%20Min…  https        application/xml
     9 │ 2022-04-21  01:20:35     11435  GET        /                  200  Expanse%20indexes%20customers%E2…  https        text/html
    10 │ 2022-04-21  01:40:37      2269  GET        /index.xml         200  FreshRSS/1.19.2%20(Linux;%20http…  https        application/xml
    11 │ 2022-04-21  01:11:37       581  GET        /                  301  Expanse%20indexes%20customers%E2…  http         text/html
   ⋮   │     ⋮          ⋮         ⋮          ⋮           ⋮           ⋮                      ⋮                       ⋮              ⋮
 39395 │ 2021-08-31  22:47:08       590  GET        /kefu.php          301  Mozilla/5.0%20(Macintosh;%20Inte…  http         text/html
 39396 │ 2021-08-31  22:47:11       588  GET        /im/h5/            301  Mozilla/5.0%20(Macintosh;%20Inte…  http         text/html
 39397 │ 2021-08-31  22:47:11       590  GET        /info.php          301  Mozilla/5.0%20(Macintosh;%20Inte…  http         text/html
 39398 │ 2021-08-31  22:47:17         0  GET        /im/h5/              0  Mozilla/5.0%20(Macintosh;%20Inte…  https        -
 39399 │ 2021-08-31  22:47:13       582  GET        /                  301  Mozilla/5.0%20(Linux;%20Android%…  http         text/html
 39400 │ 2021-08-31  22:47:11         0  GET        /im/                 0  Mozilla/5.0%20(Macintosh;%20Inte…  https        -
 39401 │ 2021-08-31  22:47:19       591  GET        /mtja.html         301  Mozilla/5.0%20(Macintosh;%20Inte…  http         text/html
 39402 │ 2021-08-31  22:29:07       399  HEAD       /                  301  go-resty/2.3.0%20(https://github…  http         text/html
 39403 │ 2021-08-31  22:29:08       320  HEAD       /                  200  go-resty/2.3.0%20(https://github…  https        text/html
 39404 │ 2021-08-31  23:46:57       582  GET        /                  301  Mozilla/5.0%20(Macintosh;%20PPC%…  http         text/html
                                                                                                                          39383 rows omitted

Already some interesting URIs in there back when the site was starting out… I suppose bots are gonna bot. Anyhow, now we can start checking through the data in earnest. First, as a quick litmus test, let’s make sure that most requests we get actually end up satisfied. Quickest yardstick we can get would be probably status codes. Let’s group by sc-status, count the rows and sort by the count.

status_group = groupby(df, ["sc-status"])
sort(
    combine(status_group, nrow => :count),
    [order(:count, rev = true)]
)

7 rows × 2 columns

sc-statuscount
Int64Int64
120026249
23016508
34045033
43041464
540377
6069
72064

Well then! It seems that most our requestors come away with what they intended to get. I’m wondering about both 404 and 301 statuses. I think we oughtta drill down into what URIs were actually requested there. We’ll do the exact same thing, but filter first.

show(
    sort(
        combine(
            groupby(filter("sc-status" => n -> n == 404, df), ["cs-uri-stem"]),
            nrow => :count
        ),
        [order(:count, rev = true)]
    ),
)
458×2 DataFrame
 Row │ cs-uri-stem                        count 
     │ String                             Int64 
─────┼──────────────────────────────────────────
   1 │ /robots.txt                         2867
   2 │ /favicon.ico                         350
   3 │ /wp-login.php                        303
   4 │ /favicon.svg                         162
   5 │ /apple-touch-icon.png                125
   6 │ /image.png                           102
   7 │ /ads.txt                              46
   8 │ /humans.txt                           35
   9 │ /.env                                 30
  10 │ /.git/config                          18
  11 │ //wp-includes/wlwmanifest.xml         14
  ⋮  │                 ⋮                    ⋮
 449 │ /login                                 1
 450 │ /login.php                             1
 451 │ /login.aspx                            1
 452 │ /login/                                1
 453 │ /Login                                 1
 454 │ /login.html                            1
 455 │ /login.asp                             1
 456 │ /.aws                                  1
 457 │ /new/license.txt                       1
 458 │ /plugins/elfinder/connectors/php…      1
                                437 rows omitted

Mixed bag, I suppose! Some completely legitimate requests - such as robots.txt, ads.txt, humans.txt and favicons, all of which should probably show up on my TODOs for Planet Golang. And then there’s the script kiddie (or these days, more realistically, bot) stuff like looking for exposed credentials, login panels and Wordpress installations.

What’s up with those 301s though?

show(
    sort(
        combine(
            groupby(filter("sc-status" => n -> n == 301, df), ["cs-uri-stem"]),
            nrow => :count
        ),
        [order(:count, rev = true)]
    ),
)
691×2 DataFrame
 Row │ cs-uri-stem                        count 
     │ String                             Int64 
─────┼──────────────────────────────────────────
   1 │ /                                   3991
   2 │ /robots.txt                          478
   3 │ /wp-login.php                        300
   4 │ /favicon.ico                         245
   5 │ /ads.txt                              45
   6 │ /what.html                            45
   7 │ /humans.txt                           35
   8 │ /.env                                 33
   9 │ /style.css                            24
  10 │ /index.xml                            23
  11 │ /1.html                               23
  ⋮  │                 ⋮                    ⋮
 682 │ /info.php/_profiler/phpinfo            1
 683 │ /assets/elfinder/src/connectors/…      1
 684 │ /old/license.txt                       1
 685 │ /assets/plugins/elfinder/src/con…      1
 686 │ /admin/elfinder/src/connectors/p…      1
 687 │ /server-status                         1
 688 │ /.aws                                  1
 689 │ /new/license.txt                       1
 690 │ /plugins/elfinder/connectors/php…      1
 691 │ /118.html                              1
                                670 rows omitted

Oh. It’s… more of the same. Well, live and learn I suppose. (Side note - I’m checking out the entirety of those data frames, but they’d be super boring to list here).

But these were all just stretches, right? We all know what I’m about. As a proponent of RSS, I wonder whether the HTML or the RSS version of the site is more popular! Also, I think we’ll extract the month from our date and group on that so we can see the trend over time. Without further ado, let’s add some columns to our DF.

using Dates

@chain df begin 
    @rtransform! :month = Dates.format(:date, "yyyy-mm")
    @rtransform! :category = :"sc-content-type" == "text/html" ? "html" : (:"sc-content-type" == "application/xml" ? "rss" : "none")
end

month_category = sort(
        combine(
            groupby(
                filter(
                    "sc-status" => n -> n == 200, filter(
                        "category" => n -> n == "rss" || n == "html", df
                    )
                ), ["month", "category"]),
            nrow => :count
        ),
        [order(:month), order(:category)]
    )

show(month_category)
18×3 DataFrame
 Row │ month    category  count 
     │ String   String    Int64 
─────┼──────────────────────────
   1 │ 2021-08  html        192
   2 │ 2021-08  rss          10
   3 │ 2021-09  html       1221
   4 │ 2021-09  rss         469
   5 │ 2021-10  html       1433
   6 │ 2021-10  rss         593
   7 │ 2021-11  html       1605
   8 │ 2021-11  rss         647
   9 │ 2021-12  html       2113
  10 │ 2021-12  rss        1649
  11 │ 2022-01  html       2665
  12 │ 2022-01  rss        1241
  13 │ 2022-02  html       2544
  14 │ 2022-02  rss         998
  15 │ 2022-03  html       2795
  16 │ 2022-03  rss        1507
  17 │ 2022-04  html       1752
  18 │ 2022-04  rss        1223

Good, this is about what I was going for. And now, for my last trick… the same stuff, in plot form!

@df month_category plot(groupedbar(:month, :count, group = :category, bar_position = :stack))

svg

Hmph. Well, no dice - it’s clear that HTML requests still dominate :) It is however heartening to see that people use the site and that it’s gaining traction!

As I mentioned on Planet Golang itself, “planets” are slowly becoming a thing of the past. It’s pretty much “curated this” and “algorithmic that” these days, and while I may not consider these the best ways to consume content, they are certainly prevalent. If you’re interested in an unopinionated, anything-goes feed for Golang, however, I certainly recomment visiting planetgolang.dev and checking it out yourself.

« back to all blogs