Transcript of "Q&A session with our data transformation experts"

ANA SANTOS: Hello, good afternoon. Let’s just give a bit of time for people to start joining and we’ll start in about two minutes. It’s kind of lunchtime now so- Afternoon everyone.

CHASEY DAVIES-WRIGLEY: People maybe grabbing their lunch before they sign on.

ANA SANTOS: Yeah, that’s it.

Just give it a little bit, another minute or so and then we can start.

I was at a webinar yesterday and it actually took around like three or four minutes for everyone to join because it just ran- Especially at lunch time, you’re running around, your grabbing and then- [laughs] Then you sit down eating and join. So another minute or so and we start because it’s a short one today, half an hour, we want to make sure we go through all the questions.

Hopefully you can see my screen okay?

CHASEY DAVIES-WRIGLEY: Yeah.

ANA SANTOS: Yeah, perfect.

Okay, I think we’ll start now and then people will join us and can ask questions. So welcome everyone, this is our Data Q&A webinar with our data experts at Made Tech, so just, you know, operational things about the webinar. So we’re recording it, so we’ll share the recording after the webinar for people that are here but also the ones that might not be able to join today. We will go through a series of questions that we had before, you know, that our team had before in different kind of webinars and talks they had, but also general questions they get often in their work, and then at the end we’ll open up for any questions that you might have in the audience to ask.

In terms of how you submit them,so there should be at the bottom- You should have a Q&A box, so feel free to submit any questions there. If you have any comments or you just want to say hi, feel free to use the chat box as well, and I’ll get back to it at the end, so in the last 10 minutes or so. So I hope that’s okay, and I’ll get it moving.

Hi Deborah, I can see your message there.

Right, so who are we first before we go into the detail, data detail. So I’m Ana, I’m the Events Lead for Made Tech and I’m here hosting this discussion, this Q&A with our experts. Chasey, do you want to introduce yourself?

CHASEY DAVIES-WRIGLEY: Yeah, hi everyone, my name’s Chasey Davies-Wrigley. I’m a Principal Data Engineer at Made Tech, technology leader with a highly technical background including MSE in computer science, and I’ve got over 22 years’ experience across both private and public sectors; and I’ve been responsible for distributed, multidisciplinary teams building, architecting, delivering and maintaining highly reliable and scalable systems across a whole variety of platforms and technologies.

But here at Made Tech I’m focused on empowering the public sector to deliver and continuously improve digital services that are user-centric, data-driven and free from legacy technology. Before I hand over to Catherine I’ll just say for anyone that’s watching the recording, just reach out to me on LinkedIn as well if you’ve got any questions, I’ll happily connect and answer any questions via that. So, do you want me to hand over to Catherine?

CATHERINE QUILLEY: Thank you.

ANA SANTOS: Yeah.

CATHERINE QUILLEY: Hi, I’m Catherine Quilley, I’m a Senior Data Engineer here at Made Tech, I’ve been with Made Tech almost getting towards a year-and-a-half almost, and I have a decade of experience in technology mostly in finance but I also used to be a business analyst before I became a data engineer. I enjoy building data pipelines for our clients but also upskilling and training, that’s what motivates me; and you’ll find me on projects partnering with people like James to help automate and scale data platforms so we can get useful and interesting insights from data. I’m going to pass over to James now.

JAMES POULTEN: Thank you Cat, yeah, hello everybody, good afternoon. My name’s James, I am a Data Scientist here at Made Tech, the lead data scientist in fact. I only have five years in technology but in that time I’ve been consulting all through local and central government, including some time with the Number 10 Data Science Group, the Cabinet Office, the levelling up taskforce, DCMS, the Met Office, the Home Office, Digital Passenger Services at the Home Office. So I’ve been around helping them implement data science in general from facial recognition software through to regression trees, to consulting on this new fancy generative AI that everyone’s hearing about.

ANA SANTOS: Thank you, sounds good. So, yeah, we’ll move on to some frequently asked questions that all of you have had before in your talks and in your jobs, and in other data kind of environments. So I’ll start with Chasey. So this question we actually ran a LinkedIn poll on it this past week, and you can see I’ve put a bit of a screenshot in there around like the results. So what do you think is the main reason why a lot of public sector data remains unutilised, and a lot in this question around skills, around tools, around ways of working. We can see from the LinkedIn poll a lot of people think it’s ways of working in the public sector, and we got a few comments as well but if you want to go through it Chasey, and give us your thoughts?

CHASEY DAVIES-WRIGLEY: Yeah, I’ll just check I’m not mute- Yeah. Yeah, so, as you can see from the poll really it’s not possible to just point to one reasons it’s more complicated than just saying, train your people up, or just invest in new tooling or you should adopt more agile ways of working. But that said, like you can see with the poll, ways of working is certainly high on the list of reasons. Because civil servants are under immense pressure to deliver, to often unrealistic deadlines; and budget cuts means that they’re often doing multiple roles at the same time. During this poll we had a great quote from Andrew Ferguson, who’s lead data scientist at the Office for National Statistics, and he said, To paint a picture it’s incredibly challenging to build a life raft while you’re drowning. I thought that was a great quote because it really sums up what I hear a lot from civil servants. The regulations around data usage are strict and using new tools or setting up new ways of working often requires going through so many layers of bureaucracy which takes time, energy and also accountability. And if something doesn’t work or if they advocate for the wrong tool, by the time they get approval to go forward and actually use it, it’s already outdated or they risk being seen as the person who made a mistake, and that can have negative career implications for them.

But when it comes to tooling a common issue though is knowing what data they have and where to find it, and this was also quite peppered through the comments on the poll. It sounds obvious but I’d say this is great starting point, and on a positive note I’m seeing more government departments investing in things like data catalogue technology. I’ve seen situations in large, central government departments where one team have collated a vast amount of data in a very specific area, and then another team in the same government department have not actually known about that, and then they’ve gone out and commissioned and paid a third party to go and collate the same data.

So tooling can really help with data discoverability, like using data catalogues, but equally you also need the skills to build that in the first place and maintain it going forward. And then there’s training for users on how to use it. Tools like data catalogues also rely on the data being meaningfully tagged so that people can find what they need and training’s required, not just on how to use the catalogue but how to use it as part of your everyday role. So it really also involves evangelising about its value and broadcasting its availability because if you think, a floor will never be brushed if no one knows the broom is in the cupboard.

ANA SANTOS: Makes sense, thank you Chasey. I should say as well if you have any comments or any like questions on top of this for Chasey, feel free to use the Q&A throughout and we’ll address them as we go along or at the end as well. But, yeah, it was really interesting to read that poll in the comments that came through, because it’s something that we see a lot in our projects with the public sector and, yeah, it was good to see there’s a lot of the same thought, which means there’s a lot of people that are willing to kind of change and make it better so it’s quite good.

The other question we had for Chasey is, so do you think the public sector currently has the right skills to implement real change based on accurate data; and how do you see suppliers like yourself – thisisfrom a question,so like usin this case – help with this. So what happens when kind of the contract finishes, we do what we had to do and then we have to leave, what do you see that- How do you see that happening?

CHASEY DAVIES-WRIGLEY: So across government departments the levels of skills vary considerably but on the whole civil service data engineering teams tend to be under resourced and over worked, so it leaves little opportunity for them to have head space or the time to upskill whilst still delivering against the aggressive, existing targets that they have to meet. The role of organisations like us, like Made Tech, is that we can bring new ways of working, pooling ideas and examples of what has already worked well in other departments.

So we can be a conduit then for sharing best practice and experiences across departments, but we’re also mindful to work in the open, bringing the civil servant engineers on the development journey with us is important and empowering them with decision-making processes and ensuring that we also play the role of mentors and upskill them as we go. It’s really great to see confidence and skillsets growing, and people regain their passion for what they’re doing, and that’s why we often work in blended teams, not just going into a department, building a shiny new thing and then leaving, it’s really vital that we share as much as we can with those around us all the time.

ANA SANTOS: Yeah, definitely improving their skills as we go along. I think I was at an event with James yesterday and one sentence that he said really resonated so- James was saying, we try to work ourselves out of a job, wasn’t it James? So, you know, try and kind of work with these departments and give them the skills so when we leave everything is set and we don’t actually have to come back. Yeah, sounds good.

So, for Catherine as well, so there was a question around, from your experience with public sector data teams. How long does it take to set up a data catalogue with an average-sized dataset and what are the main benefits you hear after it’s been set up?

CATHERINE QUILLEY: Yeah, so I mean timeframe is always a difficult thing and there can be a few factors that influence how long it takes to get a data catalogue set up, including how much data there is, how big those datasets are, and the locations of the data and, you know, even the term average-sized dataset will mean different things to different organisations. But it’s also worth noting that within a given organisation, some of these questions can be hard to answer which is exactly why you need something like a data catalogue in the first place. The catalogues are an efficient way to collect a census of your data as well as letting users find your data.

But let’s assume that we just want to get a basic data catalogue implemented to start to just understand the data landscape better and see how people respond to it; that could be fairly quick and in previous projects we had a working version set up for testing within a few weeks. As we’ve learnt from previous projects, benefits are data is so much more easily discoverable in a centralised place, you have much better search functionality, people can then add documentation to data and they can identify opportunities to extract extra value from existing datasets, people begin to share data more readily and more regularly, and people spend less time manually updating documentation and more time getting insights from their data.

ANA SANTOS: Makes sense, thank you. Another question we had for you was, so if you have a lot of different datasets and want to gain insight from them, what approach would you recommend?

CATHERINE QUILLEY: So I try to come up with a sort of memorable, catchy way of describing this, and I came up with pinpoint, profile, pipe and present. So going back to this idea of a data catalogue, creating a record of where data is and like gathering a census of your data landscape in the form of metadata can tell you a lot. And often when we use the term, metadata, people assume that we’re talking about like file names, file types, is it a database, is it an Excel file? But actually metadata comes from people a lot, and often the best source of metadata is the people who use data in and out every single day. So they’ll often have detailed knowledge about the datasets and how they use it, and often that data is implicit so it’s passed on by word of mouth or through emails or manual documentation. Getting that data documented in the right tool begins to breakdown silos and other issues that we see within organisations.

Then when you have that, you can begin to profile data and you make assessments about its shape and its quality, what values are in there, what’s missing, how does this data connect with other data sources; and that becomes really important because often then you find that there are a whole bunch of workflows and processes going on that have been normalised and not documented. Then you can also identify where data is duplicated or no longer needed, and then once you’ve done that you can begin the process of piping.

So that’s where you formalise the storage and processing of data, preferably in an automated and centralised way. So imagine the scenario where you’ve got someone who is processing several CSVs into one, so it can be emailed to a data scientist to process and display in a dashboard. We automate that process into a pipeline to ensure the data is consistent, traceable and robust, and then it allows people who process data to do what they do best which is asking questions, finding stories, information and insights from data or identifying additional data sources rather than constantly cleaning and processing data. This kind of unlocks scalability and paves the way for expanding data sources, processes and analysis in a more controlled environment.

And then finally we get to presenting data, and often we think that that means just putting it in a dashboard and having some graphs, and that is part of the process but it’s also giving people a mechanism to query data and be curious about it and ask questions on an ad hoc basis. So often the data is around but it’s hard to access to query, presenting it in a way where people can ask questions removes a barrier, and it means you can get more from your data with less effort. So it’s quite a complex answer but like I said, pinpoint, profile, pipe and present.

ANA SANTOS: You made it sound easy at the beginning and then you went into more detail. [laughs] Thank you Catherine.

So I think another question for you here, and this question is because Catherine recently gave a talk around, Is there such a thing as too much data? So you went into quite a lot detail on that talk and from that we, you know, the person wondered if there’s actually, there might actually be too much data, is there any case that that happens? So do you have examples of what that can look like and how to go about organising it to get rid of it maybe or to actually make sense of it?

CATHERINE QUILLEY: Yeah, so I can’t really think of a situation where there’s too much data because with the right tools and processes most scales of data could be manageable. I mean we only need to look around at how much data gets generated on the internet every day and all of those things to understand that different scales can be managed. I think the tricky part is often knowing what tools are right for where your data capability is at. So I think particularly if you’re a public sector organisation, a really good starting point is see where you feel like you fit on the government data maturity assessment framework. So doing this can help you begin to breakdown the challenges of evolving your capability and managing data into pieces and understand what the next level looks like and therefore it can help you set kind of realistic goals, and crucially ensure that you don’t try and do everything at once, which can feel like a lot and that can give the impression of too much data and too much to do.

Then that in turn, you can partner that with an understanding of how your team’s structure and your team’s skillset can be used or adapted to meet those key milestones. And because there’s this shared framework across departments, it’s a really nice opportunity for you to identify other organisations or departments who are maybe one level up from you and speak to them about their experience.

And then to answer kind of getting rid of data, I don’t think it should be a primary focus, and I think it’s crucial to say a smaller data landscape isn’t always a simpler one, particularly if your data is scattered or opaque. But structuring your strategy correctly usually leads to consolidation as you identify stale, unused or duplicate data; and it could be even considered an indicator that you’re heading in the right way in terms of data maturity. So I don’t think your aim should be to get rid of data, but I think you can monitor how you’re dealing with data in order to sort of measure it as a metric of success.

ANA SANTOS: Thank you Catherine, thanks for that. So, we’ll move on. James. So James recently gave another talk, not the one yesterday, but another talk around synthetic data and his experience with it as a data scientist. So, question for James is, why is the public sector not using synthetic data more widely in different projects and what do you think are the barriers?

JAMES POULTEN: Thank you Ana, yes, people are going to get sick of hearing from me. So, why is the public sector not using synthetic data more; the simplest answer is because it’s complicated and difficult to set up. If you’re going to create a really robust synthetic dataset then you’re going to be utilising a lot of very complicated data science techniques, modelling multivariate relationships between different features of your dataset. So it takes a team of experts to build out a synthetic dataset. It’s also quite a new technology, or it’s become more widely adopted more recently and I think the actual mathematics behind it are some 30 years old now but the fact that it’s just in a research paper doesn’t necessarily mean that it’s kind of widely available to be used in the public or private sector. Another issue that we often encounter are data quality issues. The idea behind a synthetic dataset is building a simulacrum – I think that’s the right word – for existing data so that you can use it and exploit the data in a safe way without having to worry about GDPR concerns or priority concerns. If you have low quality data to begin with you don’t know what your synthetic data should look like, so you basically end up guessing and you lose a lot of the power of synthetic data, which again is held in those multivariate relationships that are not obvious to someone just looking at the data in tabular form.

Yeah, so I think they’re the two main things, obviously there’s a lack of resource around data teams, and data is the hot topic and every team that I have interacted with has been just inundated with requests from various officials or ministers for new pieces of analysis, new graphs, new dashboards, new models, and as I say this is a complex process which in itself is a whole data project on its own. So just a lack of resource around building these things is also an issue.

ANA SANTOS: Thank you, makes sense. Do you have any examples of non-projects with synthetic data. I know we can’t disclose some of it but, yeah, is there any examples you can give?

JAMES POULTEN: So, yeah, as I was saying, synthetic data, part of the whole point of synthetic data is to create data that doesn’t hold any personally identifiable information or information that might- Isn’t able to be shared. So as you can imagine, the people that you build synthetic data for don’t like you talking about you building synthetic data, so I’m not going to dive into where we may or may not be doing that. But it’s safe to say that there are a number of projects where we are looking at how we can go about building these datasets for our clients, doing the data analysis that sets us on the way, making sure that we have a high enough quality and quantity of data and trying to work out what techniques to use.

ANA SANTOS: Thank you. I think the last question for you James, and it’s related to the skills we were talking about. So, is working with – I think you kind of alluded to it a little bit in your first answer – but is working with synthetic data exclusive to data scientists or are there opportunities for other data professionals to gain these skills?

JAMES POULTEN: There’s always opportunities for other data professionals to gain these skills, and actually something that we’re seeing with the rise of generative AI models. So AI models that can create outputs from a prompt is kind of- It’s giving access to a wider range of professionals, you don’t necessarily need to understand how a stable diffusion model works or how a GAN works, a generative adversarial network works or the architecture behind it. You can pass, in theory, your data into a – and please don’t and just put all your data into ChatGPT. So you can in theory go and put your data into a large language model and ask it to create a set of synthetic data, and these models can do that. That is a very bad idea because these models are, or take your data, your precious data- Your sensitive data and use it to train future models, and you can have things like data leakage and some serious security breaches. So don’t do it.

However, if you were to host your own model in your own secure cloud environment, all adhering to UK data protection laws, you could- I could see where non-data scientists were building synthetic datasets. At the moment I think it is still largely restricted to a data scientist because to actually evaluate the models and make sure that the data it produces has all of those complex relationships- You have to go through a whole bunch of facts and mathematics, which tends to be the realm of data science because no one else is weird enough to enjoy doing that stuff. But, yeah, there are, you know, it’s evolving day by day.

ANA SANTOS: Thank you James, finally and I know we’re running out of time, so there’s about four minutes, so if you have any questions in the audience please post it on the Q&A because we’ll go through them right at the end. There was a bonus question that we got from someone that registered that I thought we would go through before passing on to questions in the audience, and the question was, I’m interested in the implementation of data catalogues in a different government department. So any stories or experiences around that would be great, and I think Catherine, you’re taking this one?

CATHERINE QUILLEY: Yeah, sure, so Made Tech have implemented data catalogues in different context and two notable projects that I think are good to talk about are Hackney Council and DCMS, the Department for Culture, Media and Sport. These are quite interesting projects to talk about because the need for a data catalogue was motivated by slightly different use cases. So the data catalogue at Hackney was part of building a data platform, Lisa [ph Stidle 00:27:04] at Hackney put it really nicely when she described one of the motivations being to, democratise data, and here the data catalogue was predominately there to show analysts what was out there and available to use to drive additional insights. So primarily it was data discovery but crucially datasets that were sort of essentially almost ready to be used, and were usually processed. I think the key point there is, you can implement a data catalogue and you don’t have to put every single bit of data on it if you don’t want to. You can tailor to your requirements and business use case.

DCMS’s business case was slightly different. They’re embarking on a strategy to evolve their data capability and they wanted to identify ways to tackles problems like a lack of visibility on data, siloization, data being stored in different locations in different ways, working with them we were testing to see if a data catalogue was a suitable tool to tackle some or all of those issues. Not only were we testing the tools but we were testing the concepts and ideas with DCMS. We were seeing how users would respond to a data catalogue being part of their working process, would they lean back on tried and tested ways of finding data and discovering data, what conditions would need to be met to enable the adoption of something like a data catalogue that introduces a new way of working and thinking about your data. And for DCMS they saw the immediate benefit in terms of being able to identify datasets. The fact that knowledge that they might have maintained separately to the data sources could now be linked together, and they also saw that a data catalogue allows them to organise and group data in a meaningful way to them. So for example, by project or by team or by topic, but crucially it doesn’t depend on them physically moving their data to achieve that. So it allowed them to be flexible and adaptable and allowed them to tackle one problem at a time, identifying what data sources they had, identifying those silos, grouping data together in a way that’s sensible but not having to do the physical lift and shift in a data when they hadn’t quite decided what would be the best way to structure and store that data.

And then finally, they saw how these kind of tools can help them inform data strategy by not only giving them this bird’s eye view of their data landscape but also they could see how users were interacting with the catalogue. What were they searching for, how were they tagging data, this can be really useful to identify what is missing from their data landscape or what user demand there is, user demand that sometimes if you directly ask people they don’t necessarily always articulate.

So I think the crucial think or take away is that in both of these circumstances, we’ve got two organisations that are at a different stage in their data journey but a data catalogue was incredibly useful across that. So I think data catalogues have longevity, so you can have them at the beginning of your data capability journey and also all the way through, and I think the other interesting thing is for Hackney and DCMS is was exactly the same tool that they were using, which shows that the tools was able to be used. So we implemented a bespoke version of data hub and it worked for DCMS and it worked for Hackney because there was enough flexibility, there was enough functionality put in the hands of end users to make it work.

ANA SANTOS: Thank you very much Catherine. We’re at the end of time. We have one question in the Q&A box from Omid, thank you. Could you please provide insights into whether Made Tech is considering extending its services to African public sectors, if so what might it look like, how do we envisage leveraging expertise to support these regions in their digital transformation journey in terms of modernising legacy technology, accelerating digital services and enabling data-driven decision making. I don’t believe this is in our plans, I don’t know if Chasey or James or Catherine have any other comments?

CHASEY DAVIES-WRIGLEY: Yeah, I’ve not seen any talk about this, so-

ANA SANTOS: Yeah, it would be hard for us to address that but Omid, feel free to email us and we can pass it on to the relevant person to kind of be able to give you more of a detail on that, if that’s okay.

We’re two minutes over, I don’t see any other questions, if there is any you can drop it quickly otherwise I’m sure we can answer them by email as well or, as Chasey mentioned, by any of the LinkedIn profiles, I’m sure our speakers would be happy to answer them. So there’s the email address there for you to email us. Our LinkedIn and website. I will post a survey as well on the chat box if you have some time it would be great to have your feedback about this event, and thank you very much everyone for attending and thank you our data experts and speakers for joining today. It was really useful. Thanks everyone.

CHASEY DAVIES-WRIGLEY: Thanks everyone, bye.

CATHERINE QUILLEY: Thank you.

JAMES POULTEN: Thank you every one for attending, have a great day, bye.

Back to the episode