Big Data London!

Anyone who’s read The Hitchhikers Guide to The Galaxy, or seen the film, will know the seminal quote “Space is big. Really big. You just won’t believe how vastly hugely mindbogglingly big it is.”. Well apparently space has a new competitor for sheer size, and that’s data!

Every year a mass of people descend on the Olympia conference center in Hammersmith, London for the Big Data London (BDL) conference. This year I was lucky enough to attend as an exhibitor with The Oakland Group, a data led consultancy that I currently work for. Over the course of two days I tried to both take in as much as possible with regards to movement in the data / analytics world and also chat to attendees in order to understand what issues they were having and how we might be able to help. As such, below I’ve summarised a few thoughts from the event based on talks, exhibitors and those conversations!

1: Data quality and access are common blockers in getting the most from data!

The main reason for being at BDL was to chat with potential clients in order to understand what data centric issues they were having, from overall strategy down to data quality. This allows us as company to better understand the current market and determine what we can deliver to help solve these issues.

From these conversations the most common issues that arose were around actually being able to access data easily and ensuring the quality of said data. From further discussions the most likely candidates for these problems looked to be the isolation of data systems away from those trained to utilise the data (I.E data held by IT, whilst analysts sit in other functions) or poor ETL processes. As such, there may be some hope for these companies with the greater push from on prem to cloud infrastructure, which allows permissions and processes to be reset for a more data centric organisation!

2: There appears to be a trend in code free solutions?!

Of the over 100 vendors on show at BDL, there appeared to be a number offering code free tools (Such as Alteryx or Streamsets), which allow companies to improve their outputs without hiring the more technically skilled staff that are required for dealing with complex data.

On the whole this took me by surprise, as my experience with such tools has always been less positive that expected (though I’m always going to be biased based on the flexibility I have using Python and R). However, as mentioned previously I can see the merit for companies trying to rapidly upskill their use of data, looking to push away from an excel based solution but not jumping into the deeper lake of full on programming, allowing things to stay somewhat simpler!

Once concern I do have around these types of tools is the ability to optimise processes and run production scale jobs, what with being locked out of the inner code. As such, some companies may find themselves building large processes within such tools, and then translating them at a later date.

3: Managed services are on the rise!

A couple of years ago AWS brought out a product that amalgamated a variety of their offerings into one product, known as Sagemaker. Sagemaker was designed to give more power to data scientists by taking away some of the pains of developing cloud infrastructure and allowing them to focus on analysis and modelling. This was done by providing an analysis interface which could easily connect with AWS data sources and be configured to automatically develop a model endpoint in order to serve a production level model.

A key thing I noticed this years BDL, through both walking around and attending a few talks, is that these types of service appear to be on the rise. This was clear from the strong presence of DataBricks, who provide a managed platform in order to more easily utilise the Spark big data infrastructure. In addition there were a number of smaller companies who were offering to reduce the dev ops / architecture burden of data scientists, allowing them to focus on model development (Such as Algorithmia and BDB).

The most obvious reminder of this rise in managed services came when attending a talk launching Microsoft Azure’s new product Synapse. Synapse has been set up in a similar way to Sagemaker, allowing a data scientist to quickly develop models connected to various data sources. However, an additional feature is the development of the SQL engine to allow it to more quickly incorporate files typical used for model outputs. As such, analysts relying on the outputs of said models will be able to access the underlying predictions faster.

Overall I feel like this is a step in the right direction, if used properly (i.e don’t just throw out models), and whilst these services come at a price they do allow those with more analytic than infrastructure experience to get up and running until they fell they can fly on their own!

4: Everyone loves AI….

Interestingly, this years BDL fell only a week after a large controversy within the data community, namely the discovery of the highly gender biased AI used to determine credit limits for apples new credit card (https://observer.com/2019/11/goldman-sachs-bias-detection-apple-card/). As such it was interesting to contemplate how the mass of AI based companies would deal with the expected questions from their potential customers around how their products would deal with bias.

Aside from this, my main thoughts were around the sheer number of companies touting AI as the solution, fully cementing the data world in the AI hype. My personal feelings around this, based on my experience dealing with messy, incomplete data over the last few years, is that these types of solutions are often unprepared to deal with the state of companies data. This is especially true for older or larger organisations where said company has gone through multiple process changes, altering how data is collected / maintained / represented.

As such, whilst I do feel like the rise of useful AI is something that is inevitable as companies use of data matures, for now the better approach would be to have a greater understand of their current position and properly plan how to achieve their analytic goals.