My friend and I have been chatting on Discord almost daily since high school. Over the years, we've exchanged about 48,000 messages. I started wondering what kinds of things we've talked about most and whether there were any patterns in our conversations. This post covers how I downloaded, processed, and visualized our entire chat history.
Exporting the Chat History
To start, I needed a way to export our messages from Discord. I looked into their API but didn't find an easy solution. Eventually, I discovered DiscordChatExporter, an open-source tool that allowed me to download all our chat history into 47 JSON files containing around 45,000 messages. It worked perfectly.
Generating Message Embeddings
Once I had the data, I needed a way to represent each message
numerically. I decided to use BERT embeddings from the
sentence_transformers
library. Each message was
converted into a 768-dimensional vector, capturing its
semantic meaning. Here's the code I used:
def generate_bert_embeddings(df):
"""Generate or load BERT embeddings."""
embeddings_file = 'bert_embeddings.pkl'
if os.path.exists(embeddings_file):
print("Loading BERT embeddings from cache...")
with open(embeddings_file, 'rb') as f:
embeddings = pickle.load(f)
else:
print("Generating BERT embeddings using SentenceTransformer...")
model = SentenceTransformer('bert-base-nli-mean-tokens')
embeddings = model.encode(tqdm(df['content'].tolist(), desc="Encoding with BERT"), show_progress_bar=True)
with open(embeddings_file, 'wb') as f:
pickle.dump(embeddings, f)
print("BERT embeddings cached.")
return np.asarray(embeddings)
Initial Clustering Attempts
To find patterns in our chats, I used clustering algorithms. My first attempt was with HDBScan, a clustering algorithm that's good for noisy data. To visualize the high-dimensional embeddings (768 dimensions), I reduced them to two dimensions using UMAP. The results, however, were disappointing:

The clusters lacked meaningful structure, so I decided to try something else.
Switching to K-Means and t-SNE
Next, I tried K-Means, a simpler and more deterministic clustering algorithm. The algorithm works by:
- Randomly initializing K cluster centers.
- Assigning each point to the nearest cluster center.
- Recalculating the cluster centers based on the assignments.
- Repeating steps 2 and 3 until convergence.
For dimensionality reduction, I replaced UMAP with t-SNE, which preserves local relationships in the data. The combination of t-SNE and K-Means produced much better results:

Moving to Gaussian Mixture Models (GMM)
Although K-Means improved the clusters, it assumes that each cluster is spherical, which isn't ideal for real-world data. To address this, I switched to Gaussian Mixture Models (GMMs). GMMs treat each cluster as a Gaussian distribution and use a probabilistic approach to assign points to clusters.
Mathematically, GMMs model the data as:
$$P(X_i) = \sum_{k=1}^K \pi_k \cdot \mathcal{N}(X_i | \mu_k, \Sigma_k)$$Where:
- \(\pi_k\): Weight of the \(k\)-th cluster.
- \(\mu_k\): Mean of the \(k\)-th cluster.
- \(\Sigma_k\): Covariance matrix of the \(k\)-th cluster.
To optimize these parameters, GMMs use the Expectation-Maximization (EM) algorithm:
- E-step: Estimate the probability of each point belonging to each cluster.
- M-step: Update \(\pi_k\), \(\mu_k\), and \(\Sigma_k\) to maximize the likelihood.
def cluster_documents_gmm(tfidf_matrix, n_components=100, max_iter=100, tol=1e-4):
"""Cluster documents using Gaussian Mixture Models."""
X = tfidf_matrix.toarray()
kmeans = KMeans(n_clusters=n_components, random_state=42).fit(X)
means = kmeans.cluster_centers_
covariances = np.var(X, axis=0) + 1e-6
weights = np.full(n_components, 1 / n_components)
log_likelihood = 0
for iteration in range(max_iter):
# E-step
log_prob = np.zeros((n_samples, n_components))
for k in range(n_components):
diff = X - means[k]
exponent = -0.5 * np.sum((diff ** 2) / covariances, axis=1)
log_prob[:, k] = np.log(weights[k] + 1e-10) - 0.5 * np.sum(np.log(2 * np.pi * covariances)) + exponent
log_prob_norm = logsumexp(log_prob, axis=1)
responsibilities = np.exp(log_prob - log_prob_norm[:, np.newaxis])
# M-step
Nk = responsibilities.sum(axis=0)
weights = Nk / n_samples
means = (responsibilities.T @ X) / Nk[:, np.newaxis]
covariances = (responsibilities.T @ (X ** 2)) / Nk - means ** 2
covariances = np.maximum(covariances, 1e-6)
# Check for convergence
new_log_likelihood = np.sum(log_prob_norm)
if np.abs(new_log_likelihood - log_likelihood) < tol:
break
log_likelihood = new_log_likelihood
labels = np.argmax(responsibilities, axis=1)
return labels
The results with GMMs were the best so far, with more distinct and meaningful clusters:

Rendering Thousands of Points
To show all the messages at once, I needed a frontend that wouldn't slow down with tens of thousands of points. I used an HTML canvas and a bit of D3 for zooming and panning. Canvas is more efficient than creating individual SVG elements, so it runs smoothly even at this scale. I also created a color palette, then used interpolation to expand it to more clusters. After that, I wrote a small seeded random function to shuffle the color list, so clusters that are similar don't get placed next to each other. I also added a highlight feature: clicking on a point dims all the other clusters, making it easier to focus on that one cluster alone. Finally, I added hover tooltips to show each message's text, author, and timestamp. This setup made it simple to explore the entire history without lag.
Here's a small snippet from script.js
that shows
how I set up the canvas and draw each point. I'm redrawing
everything on every zoom or pan event, which might sound
expensive, but in practice it performs well. Because I'm
using canvas, drawing tens of thousands of points remains
responsive.
const width = window.innerWidth;
const height = 800;
const pixelRatio = window.devicePixelRatio || 1;
const canvas = document.getElementById("chart");
canvas.width = width * pixelRatio;
canvas.height = height * pixelRatio;
canvas.style.width = width + "px";
canvas.style.height = height + "px";
const context = canvas.getContext("2d");
context.scale(pixelRatio, pixelRatio);
function drawPoints(data, transform) {
context.save();
context.clearRect(0, 0, width, height);
context.translate(transform.x, transform.y);
context.scale(transform.k, transform.k);
data.forEach(d => {
context.beginPath();
context.arc(xScale(d.x), yScale(d.y), 3 / transform.k, 0, 2 * Math.PI);
context.fillStyle = getColorForCluster(d.cluster_label);
context.fill();
});
context.restore();
}
// Called inside a zoom handler:
d3.select(canvas).call(d3.zoom().on("zoom", event => {
drawPoints(myData, event.transform);
}));
In this example, xScale
and yScale
are standard D3 linear scales, and
getColorForCluster
returns a color based on
cluster label. I also have a tooltip system that listens for
mouse events on the canvas to figure out which point I'm
hovering over. This way, I can click to highlight a cluster,
zoom in to examine smaller groups, and explore all the
messages with ease.
Cluster Highlights
With the canvas working, I had a lot of fun going through all the clusters. Some were straightforward, like the "Research" which contained all our messages about research. Some other clusters that caught my eye were "Bet," which I guess was a message we sent so frequently that it formed it's own cluster, and "Join", which just contained messages from my friend telling me to join our weekly calls over and over.



Conclusion
This project was a great way to revisit years of conversations with a friend and explore clustering techniques. While the results aren't perfect, they reveal clear patterns in our topics. All the code is available on GitHub. There's still room for improvement, but this was a fun start.