Setting your data expectations - Data profiling and testing with the Great Expectations library & Databricks

Introduction - Why Data Quality matters and the Great Expectations library Over the last decade or so companies have been striving to make better use of their data. The use cases for such projects have generally fallen under two categories, improve operational efficiency or drive customer sales / behaviour. However, in order to utilise this data it must first be piped from source systems (CRM, ordering, POS etc) into somewhere with greater redundancy. [Read More]

The Birthday Paradox, Tidyverse to Pandas

Whilst programming is often portrayed as quite a serious endeavor (someone hacking into a bank, a security analyst saving the world, a financial quant spending 80 hours a week modelling interest rates etc…) there is alot of fun and inventiveness that can be had with it. One such example is solving puzzles through simulating potential outcomes to determine the probability of an event occurring. The first person who comes to mind on this topic, at least for me, is David Robinson (@drob - https://twitter. [Read More]

Big Data London!

Anyone who’s read The Hitchhikers Guide to The Galaxy, or seen the film, will know the seminal quote “Space is big. Really big. You just won’t believe how vastly hugely mindbogglingly big it is.”. Well apparently space has a new competitor for sheer size, and that’s data! Every year a mass of people descend on the Olympia conference center in Hammersmith, London for the Big Data London (BDL) conference. This year I was lucky enough to attend as an exhibitor with The Oakland Group, a data led consultancy that I currently work for. [Read More]

Building an ETL pipeline with Azure Batch

So, this is my first post in a long while, partly due to a busy summer and partly due to shifting jobs (which added to the busyness!). Since June I’ve been working as a data scientist / dev at a small, but rapidly growing, tech consultancy, which so far has lead to me learning tonnes of new skills and working on some interesting projects. It’s because of this jump that I can actually write this post, as one project was the automation of a very manual, local process through the use of the Azure platform. [Read More]

Intro to Spark and building an AWS EMR cluster

This post has mainly been written for me, so that in the future I have a general reference guide for spark and so I don’t have to piece together the various bits I’ve found strewn across the internet in order to build an EMR instance that I like. Also I originally wanted to build this to test whether reticulate could be used to link in with Spark on an EMR (elastic map reduce) cluster. [Read More]

Football Modelling part 3 - Team predictions

This document will serve as the final part (part 3) of the premier league analysis / modelling, and will be focussed on the player data collected in part one, which will then be aggregated inot teams and used for predicting some old matches. As before, first thing to do is load the required packages, for this piece the requirements should be filled by both the tidyverse, for cleaning / iterating, and rvest to scrape some fixtures. [Read More]

Football Modelling part 2 - EDA and Modelling

This document will serve as part two, focussing on exploring the data collected in part one and building the final model to test on player data, as such it will be focussed at the team level. As before, first thing to do is load the required packages, for this piece the requirements should be filled by both the tidyverse and tidymodels packages, as well as the required model packages and ggrepel / gghighlight to add tweaks to charts. [Read More]

Football Modelling part 1 - Web scraping

This piece of work will focus on scraping football data and using it to build a statistical model to predict the outcomes of premier league matches in England. It will be comprised of three parts, all aiming to use skills I have acquired over the past couple of years. Part one: Obtaining data on both current players (past three years) and teams for the last four seasons of matches [Read More]

Clustering comparison

Intro This short post is an attempt to compare two commonly used clustering methods, hierarchical and k means, utilising a wheat seeds data set from the online machine learning repository Analysis First thing as always is to load packages, with a mix of cleaning, clustering and graphing packages for this analysis. library(tidyverse) library(clustree) library(cluster) library(scales) library(gridExtra) library(devtools) library(gganimate) Now its time to load in the data set and tidy it up. [Read More]

The life of Sam Vimes - A Discworld NLP project

I’ve never been a huge reader, though when I do tend to get into a book it’s usually part of a larger series (ASOIAF, Harry Potter, LoTR etc). By far my favorite series of books belongs to the collection know as discworld novels, written by Sir Terry Pratchett. Between 1983 and 2015 he wrote 41 novels all surrounding a single universe, where a giant tortoise holding up a disc shaped world roamed the sky’s. [Read More]