I have recently written a dashboard to explore the preference flows of votes during the 2024 QLD state elections. The dashboard is available here.

The dashboard has three tabs: 1. Overall preference flows, 2. Preference flows in each electorate, 3. A what-if scenario to test the impact of different levels of preference flows from the Greens to Labor.

In this first part of a four part series of posts, I will discuss how I generated the overall preference flow Sankey diagram and a corresponding table.

Getting the data

The election data is available from the QLD Electoral commission.

The first step is to get data about all the electorates. This is available in the form of a JSON file from the QEC website.

import requests
import json

electorates_url = (
    "https://resultsdata.elections.qld.gov.au/SGE2024-electorates.json"
)
resp = requests.get(electorates_url)

electorates = resp.json()["electorates"]

electorates[0]

{'electorateId': 893,
 'parentElectorateId': None,
 'electorateName': 'Algester',
 'stub': 'algester',
 'contestType': 'State',
 'electorateType': 'State',
 'votingMethod': 'Attendance Ballot',
 'votingSystem': 'Compulsory Preferential Voting',
 'votingSystemMayor': None,
 'votingSystemCouncillor': None,
 'indicativeCount': True,
 'candidatesCount': 6,
 'candidates': [{'ballotName': 'BOSLEY, Rhys',
   'ballotOrderNumber': 1,
   'candidateSurname': 'BOSLEY',
   'candidateGivenNames': 'RHYS ANTHONY',
   'party': 'Independent',
   'partyCode': 'IND'},
  {'ballotName': 'TURNER, Jane',
   'ballotOrderNumber': 2,
   'candidateSurname': 'TURNER',
   'candidateGivenNames': 'JANE LOUISE',
   'party': 'Family First Queensland',
   'partyCode': 'Family First'},
  {'ballotName': 'PRASAD, Jitendra',
   'ballotOrderNumber': 3,
   'candidateSurname': 'PRASAD',
   'candidateGivenNames': 'JITENDRA PRABHAKAR',
   'party': 'Liberal National Party of Queensland',
   'partyCode': 'LNP'},
  {'ballotName': 'ENOCH, Leeanne',
   'ballotOrderNumber': 4,
   'candidateSurname': 'ENOCH',
   'candidateGivenNames': 'LEEANNE MARGARET',
   'party': 'Australian Labor Party (State of Queensland)',
   'partyCode': 'Australian Labor Party'},
  {'ballotName': 'MARIS, George',
   'ballotOrderNumber': 5,
   'candidateSurname': 'MARIS',
   'candidateGivenNames': 'GEORGE',
   'party': "Pauline Hanson's One Nation Queensland Division",
   'partyCode': 'One Nation'},
  {'ballotName': 'WILDIN, Andrea',
   'ballotOrderNumber': 6,
   'candidateSurname': 'WILDIN',
   'candidateGivenNames': 'ANDREA SUSAN',
   'party': 'Queensland Greens',
   'partyCode': 'The Greens'}],
 'enrolment': 40236}

There’s lots of information about each electorate inlcuding the candidates and total enrolment. But we’re mostly interested in the electorate’s name and the code (stub) used by the QEC API.

electorate_names = [x["stub"] for x in electorates]

Preference count for each electorate

The first preference votes, preference flows and the final result are all available with a single API call. > https://resultsdata.elections.qld.gov.au/SGE2024-preference-count-district-{electorate_stub}.json

Let’s see what it looks like for the Aspley electorate.

from string import Template

prefs_url = Template(
    "https://resultsdata.elections.qld.gov.au/SGE2024-preference-count-district-$electorate.json"
)
electorate = "aspley"
res = requests.get(prefs_url.substitute(electorate=electorate))
result = res.json()

This json file has a lot of information. So let’s break this down into different chunks.

Primary votes

# primary votes
result["preferenceDistributionDetails"]["primary"]

[{'ballotName': 'HALL, Allan',
  'ballotOrderNumber': 1,
  'party': "Pauline Hanson's One Nation Queensland Division",
  'partyCode': 'One Nation',
  'primary': 1539},
 {'ballotName': 'CAPELL, Wayne',
  'ballotOrderNumber': 2,
  'party': 'Family First Queensland',
  'partyCode': 'Family First',
  'primary': 707},
 {'ballotName': 'MELLISH, Bart',
  'ballotOrderNumber': 3,
  'party': 'Australian Labor Party (State of Queensland)',
  'partyCode': 'Australian Labor Party',
  'primary': 13988},
 {'ballotName': 'COOPER, Amanda',
  'ballotOrderNumber': 4,
  'party': 'Liberal National Party of Queensland',
  'partyCode': 'LNP',
  'primary': 15696},
 {'ballotName': 'HAWKINS, Fiona',
  'ballotOrderNumber': 5,
  'party': 'Queensland Greens',
  'partyCode': 'The Greens',
  'primary': 3817}]

Final tally

This gives us the final two party/candidate tally.

result["candidates"]

[{'ballotName': 'MELLISH, Bart',
  'ballotOrderNumber': 3,
  'party': 'Australian Labor Party (State of Queensland)',
  'partyCode': 'Australian Labor Party',
  'count': 17889,
  'percentage': '50.04%'},
 {'ballotName': 'COOPER, Amanda',
  'ballotOrderNumber': 4,
  'party': 'Liberal National Party of Queensland',
  'partyCode': 'LNP',
  'count': 17858,
  'percentage': '49.96%'}]

Preference distributions

This contains information about each preference distribution: the excluded candidate, preferences distributed to each remaining candidate, and the new tally of the remaining candidates.

result["preferenceDistributionDetails"]["distributions"][0]

{'exclusion': 1,
 'excludedCandidate': 'CAPELL, Wayne',
 'excludedCandidateBallotOrder': 2,
 'excludedCandidateParty': 'Family First Queensland',
 'excludedCandidatePartyCode': 'Family First',
 'excludedCandidateVotes': 707,
 'exhausted': 0,
 'exhaustedPercentage': '0.00%',
 'totalExhausted': 0,
 'votesDistributed': 707,
 'votesRemainingInCount': 35747,
 'preferences': [{'ballotName': 'HALL, Allan',
   'ballotOrderNumber': 1,
   'party': "Pauline Hanson's One Nation Queensland Division",
   'partyCode': 'One Nation',
   'preferences': 302,
   'preferencesPercentage': '42.72%',
   'runningTotal': 1841},
  {'ballotName': 'MELLISH, Bart',
   'ballotOrderNumber': 3,
   'party': 'Australian Labor Party (State of Queensland)',
   'partyCode': 'Australian Labor Party',
   'preferences': 113,
   'preferencesPercentage': '15.98%',
   'runningTotal': 14101},
  {'ballotName': 'COOPER, Amanda',
   'ballotOrderNumber': 4,
   'party': 'Liberal National Party of Queensland',
   'partyCode': 'LNP',
   'preferences': 179,
   'preferencesPercentage': '25.32%',
   'runningTotal': 15875},
  {'ballotName': 'HAWKINS, Fiona',
   'ballotOrderNumber': 5,
   'party': 'Queensland Greens',
   'partyCode': 'The Greens',
   'preferences': 113,
   'preferencesPercentage': '15.98%',
   'runningTotal': 3930}]}

Downloading all data

Next, we want to dowload all the data and save that somewhere to avoid repeating the API calls.

Before doing that, lets also assign colours to the parties. These are based on the colours commonly associated with these parties. https://peo.gov.au/understand-our-parliament/your-questions-on-notice/questions/what-are-the-colours-of-the-australian-political-parties

I’ve assigned unique colours to other parties whose colours I’m not aware of. And we also need colours for independents.

colours = {
    "ALP": "#FF0000",
    "Family First": "#1CE6EF",
    "KAP": "#8B0000",
    "LCQP": "#788D66",
    "LNP": "#0000FF",
    "One Nation": "#FFA500",
    "The Greens": "#00FF00",
    "Animal Justice Party": "#885578",
    "Libertarians": "#FF34FF",
}

# For independent candidates
other_colours = [
    "#000000",
    "#FFFF00",
    "#FAD09F",
    "#FF8A9A",
    "#D157A0",
    "#FF4A46",
]

Based on the structure of the results we saw above and also the requirements for this project, we will create three dataframes focusing on the first preferences, final result, and the distributions.

We’ll use pandas, particularly the json_normalize and the DataFrame.from_records - based on the JSON structure provided by the API.

import pandas as pd

Let’s specify what keys we are interested in.

exclusion_cols = [
    "exclusion",
    "excludedCandidate",
    "excludedCandidatePartyCode",
    "excludedCandidateBallotOrder",
    "votesDistributed",
]
pref_cols = [
    "ballotName",
    "preferences",
    "preferencePercentage",
    "runningTotal",
    "partyCode",
    "ballotOrderNumber",
]

We’ll define a couple of methods to parse the data from the JSON.

The first method we’ll define is to parse the votes received by candidates in either the primary count or the final count. We have already seen the strucutre of this data above. One key difference between the primary and final counts is the keys used to denote the votes: primary and count respectively. We’ll get around this by rename the primary column to count (if it exists) in the dataframe.

def parse_candidate_votes(data):
    """Parse the votes received by each candidate in the `data`."""
    counts = pd.DataFrame.from_records(data, exclude=["party"]).rename(
        columns={
            "ballotName": "candidate",
            "ballotOrderNumber": "ballotOrder",
            "partyCode": "party",
            "primary": "count",  # rename 'primary' to 'count' if it exists
        }
    )
    # Now we have a 'count' column irrespective of the data

    # assign colours to the candidates based on their party
    counts["colour"] = counts["party"].map(colours)

    # independents without a party colour
    no_colour = counts["colour"].isna()
    counts.loc[no_colour, ["colour"]] = other_colours[: no_colour.sum()]
    return counts

The next method is to parse the preference distributions in each distribution round. This will take a list of distributions and return a dataframe.

def parse_distribution(data):
    distribution = pd.json_normalize(data, "preferences", exclusion_cols)

    # remove redundant columns
    distribution = distribution.loc[
        :, distribution.columns.isin(exclusion_cols + pref_cols)
    ]

    # rename columns
    distribution = distribution.rename(
        columns={
            "ballotName": "toCandidate",
            "ballotOrderNumber": "toBallotOrder",
            "partyCode": "toParty",
            "runningTotal": "toRunningTotal",
            "excludedCandidate": "fromCandidate",
            "excludedCandidatePartyCode": "fromParty",
            "excludedCandidateBallotOrder": "fromBallotOrder",
        }
    )
    return distribution

Now, we’re ready to download the data.

first_prefs, distributions, final_tallies = [], [], []
for electorate in [x["stub"] for x in electorates]:
    res = requests.get(prefs_url.substitute(electorate=electorate))
    result = res.json()

    # first preferences
    first_pref = parse_candidate_votes(
        result["preferenceDistributionDetails"]["primary"]
    )
    first_pref["electorate"] = electorate
    first_pref = first_pref.sort_values("count")
    first_prefs.append(first_pref)

    # final tally
    final = parse_candidate_votes(result["candidates"])
    final["electorate"] = electorate
    final = final.sort_values("count")
    final_tallies.append(final)

    # preference distribution
    data = result["preferenceDistributionDetails"]["distributions"]
    distribution = parse_distribution(data)
    distribution["electorate"] = electorate  # add electorate information
    distributions.append(distribution)

Tidy and save

Now that we’ve got the data, we can tidy it up and save for later use.

# we'll concatenate all electorate data into one dataframe for each data type
distribution = pd.concat(distributions).set_index("electorate")
first_pref = pd.concat(first_prefs).set_index("electorate")
final_tally = pd.concat(final_tallies)

# set appropriate data types for each dataframe
distribution = distribution.astype(
    {
        "toBallotOrder": "uint8",
        "preferences": "uint16",
        "toRunningTotal": "uint16",
        "exclusion": "uint8",
        "fromBallotOrder": "uint8",
        "votesDistributed": "uint16",
        "toParty": "category",
        "fromParty": "category",
    }
)

first_pref = first_pref.astype(
    {
        "ballotOrder": "uint8",
        "count": "uint16",
        "party": "category",
        "colour": "category",
    }
)

final_tally = final_tally.astype(
    {
        "ballotOrder": "uint8",
        "party": "category",
        "count": "uint16",
        "colour": "category",
    }
)

For making our life a bit easier later, let’s rename ‘Australian Labor Party’ to ‘ALP’.

distribution["toParty"] = distribution["toParty"].cat.rename_categories(
    {"Australian Labor Party": "ALP"}
)
distribution["fromParty"] = distribution["fromParty"].cat.rename_categories(
    {"Australian Labor Party": "ALP"}
)
final_tally["party"] = final_tally["party"].cat.rename_categories(
    {"Australian Labor Party": "ALP"}
)
first_pref["party"] = first_pref["party"].cat.rename_categories(
    {"Australian Labor Party": "ALP"}
)

One last thing: we’ll set ‘IND’ as the party value for independents - it is currently ’’.

IND = "Ind."

distribution["toParty"] = distribution["toParty"].cat.rename_categories(
    {"": IND}
)
distribution["fromParty"] = distribution["fromParty"].cat.rename_categories(
    {"": IND}
)

final_tally["party"] = final_tally["party"].cat.rename_categories({"": IND})
first_pref["party"] = first_pref["party"].cat.rename_categories({"": IND})

Now we can save the data to parquet files.

distribution.to_parquet(
    "data/qld_2024_distributions.parq", compression="brotli"
)
first_pref.to_parquet("data/qld_2024_first_prefs.parq", compression="brotli")
final_tally.to_parquet("data/qld_2024_final_tally.parq", compression="brotli")

distribution = pd.read_parquet("data/qld_2024_distributions.parq")
first_pref = pd.read_parquet("data/qld_2024_first_prefs.parq")
final_tally = pd.read_parquet("data/qld_2024_final_tally.parq")

Tabular Visualisation

First, let’s generate a pretty table using pandas. Note: we need matplotlib to use the background_gradient syle of pandas.

Some quick observations: - The most votes were transferred from the Greens and One Nation, and the LNP had one of the fewest transfers because they were almost always either 1st or 2nd.

Let’s look at the overall preference flows for the entire state. But first, I’ll ignore the independents (those with an empty party name).

distribution_party = distribution[
    (distribution["fromParty"] != IND) & (distribution["toParty"] != IND)
]

distribution_party.pivot_table(
    index=["fromParty"],
    columns=["toParty"],
    values="preferences",
    aggfunc="sum",
    margins=True,
    margins_name="Total",
    observed=True,
    fill_value=0,
).style.background_gradient(axis=1, cmap="BuPu")

toParty	Animal Justice Party	ALP	Family First	KAP	LCQP	LNP	Libertarians	One Nation	The Greens	Total
fromParty
Animal Justice Party	0	1654	592	0	999	1328	35	1383	3954	9945
ALP	0	0	0	15872	0	16167	0	3108	6912	42059
Family First	393	9923	0	669	3197	14244	53	21496	9489	59464
KAP	0	12871	0	0	0	25268	0	1475	229	39843
LCQP	0	19909	607	1317	0	12438	114	15557	10781	60723
LNP	0	8239	0	0	0	0	0	0	3030	11269
Libertarians	0	954	92	0	64	2176	0	606	1044	4936
One Nation	0	71986	0	13086	664	190359	593	0	23888	300576
The Greens	0	242729	680	2505	5846	67648	0	14919	0	334327
Total	393	368265	1971	33449	10770	329628	795	58544	59327	863142

Sankey diagram of preference flows

import holoviews as hv
from holoviews import dim, opts

hv.extension("bokeh")

The first step to generate a saneky diagram of preference flows is to get total flows from each party to every other party. For this, we’ll group the distribution dataframe by fromParty and toParty columns and sum all the preference flows for every such pair.

total_dist_flows = distribution_party.groupby(
    ["fromParty", "toParty"], observed=True
)[["preferences"]].sum()
total_dist_flows.head(10)

		preferences
fromParty	toParty
Animal Justice Party	ALP	1654
	Family First	592
	LCQP	999
	LNP	1328
	Libertarians	35
	One Nation	1383
	The Greens	3954
ALP	KAP	15872
	LNP	16167
	One Nation	3108

We can now calculate what percent of pref flows from a party to every other party out of the total preference flows from that party. We’ll do this for both incoming and outgoing flows. For example, what percentage of One Nation’s total preference distributions went to LNP, and what percent of LNP’s incoming preferences came from One Nation.

total_dist_flows["from_pct"] = (
    total_dist_flows["preferences"]  # each party-to-party distribution
    # divided by the total distributions from the 'fromParty'
    / total_dist_flows.groupby(["fromParty"], observed=True)[
        "preferences"
    ].transform("sum")
) * 100  # as a percentage

total_dist_flows["to_pct"] = (
    total_dist_flows["preferences"]  # each party-to-party distribution
    # divided by the total distributions to the 'toParty'
    / total_dist_flows.groupby(["toParty"], observed=True)[
        "preferences"
    ].transform("sum")
) * 100  # as a percentage

We’ll use the Sankey diagrams from Holoviews. holoviws.Sankey requires as input a list of tuples of the form (from, to, value). Here’s a slightly modified example from the official documentation:

    nodes = ["PhD", "Career Outside Science",  "Early Career Researcher", "Research Staff",
             "Permanent Research Staff",  "Professor",  "Non-Academic Research"]
    nodes = hv.Dataset(enumerate(nodes), 'index', 'label')
    edges = [
        (0, 1, 53), (0, 2, 47), (2, 6, 17), (2, 3, 30), (3, 1, 22.5), (3, 4, 3.5), (3, 6, 4.), (4, 5, 0.45)   
    ]
    
    value_dim = hv.Dimension('Percentage', unit='%')
    careers = hv.Sankey((edges, nodes), ['From', 'To'], vdims=value_dim)

In our case, we first need to assign unique codes to the ‘from’ and ‘to’ parties. We want to display each party in the ‘from’ and ‘to’ sides so that we can see both outgoing and incoming preference flows for each party. But Holoviews doesn’t support acyclic connections.

As a workaround, we’ll asign two unique codes for each party, for the ‘to’ and ‘from’ sections. Let’s start with the outgoing (fromParty) flows. For this, we’ll use the codes attribute of pandas.Categorical. The fromParty and toParty are already categoricals in our dataframe.

# codes for 'from' parties
total_dist_flows["from"] = total_dist_flows.index.get_level_values(
    "fromParty"
).codes

# unique codes for 'to' parties, starting after the max 'from' code
step = total_dist_flows["from"].max()
total_dist_flows["to"] = (
    total_dist_flows.index.get_level_values("toParty").codes + step
)

Now, we’ll generate the nodes dataset as in the above example.

total_dist_flows = total_dist_flows.reset_index()

from_codes = sorted(
    dict(zip(total_dist_flows["from"], total_dist_flows["fromParty"])).items()
)
to_codes = sorted(
    dict(zip(total_dist_flows["to"], total_dist_flows["toParty"])).items()
)

# merge the from/to_codes to get the dataset
nodes = hv.Dataset(from_codes + to_codes, "index", "party")

Now lets draw the first version of our sankey diagram. Note here that kdims refers to the key variables used by holoviews to index our data. And vdims are the variables that are shown in the result visualisation.

hv.Sankey(
    (total_dist_flows, nodes),
    kdims=["from", "to"],
    vdims=["preferences", "from_pct", "to_pct", "fromParty", "toParty"],
).opts(
    width=800,
    toolbar="above"
)

There are many ways in which we can improve the above plot. 1. The same parties in the from/to sides don’t have the same colours. We need to fix this. 2. We want to show party names instead of codes in the labels. 3. We want to make the hover tooltips better formatted.

We’ll do this below.

Let’s start with the colormap. We have already defined colours for each party in the beginning. Let’s used those colours and create our own colourmap.

cmap = {str(i[0]): colours[i[1]] for i in from_codes + to_codes}

When using the bokeh backend, Holoviews provides HoverTool to fine tune the tooltips. Let’s make use of that.

from bokeh.models import HoverTool

# we'll include from/from_pct and to/to_pct in the same line.
hover = HoverTool(
    tooltips=[
        ("From", "@fromParty (@from_pct %)"),
        ("To", "@toParty (@to_pct %)"),
        ("Transferred", "@preferences{0,0}"),
    ],
)

Putting everything together, we have a much more visually appealing and informative Sankey diagram.

hv.Sankey(
    (total_dist_flows, nodes),
    kdims=["from", "to"],
    vdims=["preferences", "from_pct", "to_pct", "fromParty", "toParty"],
).opts(
    labels="party",
    label_position="outer",
    edge_color=dim("to").str(),
    node_color=dim("index").str(),
    cmap=cmap,
    tools=[hover],
    toolbar="above",
    width=800
)

That’s the Saneky diagram from the dashboard. The source code for this and the rest of the dashboard is available in github here.

In the next posts in this series, I will explain how I created the other tabs in the dashboard.