Balancing breakthroughs and ethical boundaries with data and AI

ALAN: So welcome to this session on balancing breakthroughs and ethical boundaries with data and AI by James Poulten, lead data scientist at Made Tech. James is a PhD in experimental quantum matter physics and has spent six years consulting with public sector organisations, he’s currently the lead data scientist at Made Tech, and he’s worked with a range of big name public sector organisations like the Border Force, Cabinet Office, 10 DS, and he’s lent his expertise to other Departments for Levelling Up, Housing and Communities, Skills for Care and the Met Office. So he’s got a substantial background in working in the public sector and in some private sector situations. And he’s going to talk about some really important issues here about how we look at the ethical boundaries of AI, and how we consider some of the perhaps more negative aspects of what’s happening in the world of AI.

So without any further ado, I’m going to hand over to him, please if you have questions put them into the Q&A, if you have comments put them into the comments section and we will deal with them. Thanks very much.

JAMES POULTEN: Good morning everybody, hopefully everyone can see my screen or the slides at least, and thank you for coming along and attending today.

So, what are we going to talk about today? Ethics and AI, AI the big hot button technical topic at the moment and, yeah, the real drive to implement that and consider the ethical implications of AI. I’m going to be talking to you about what people are trying to say the concerns are, what actually the concerns are for the practitioners and us, the more serious concerns for organisations and companies implementing AI. I’m going to talk you about how we stop bad actors in the AI space and I’m going to try and end on a message of hope because as a warning the first section of this is quite a bleak outlook for AI, even though I am an AI optimist.

So, who am I and why am I talking to you? I’m Dr James Poulten, as Alan said, I have a PhD in experimental quantum matter physics, and have been consulting in the public sector as a data analyst, data scientist, AI expert for the past six years. I’ve worked extensively with Border Force building reporting systems for them including a system that helped report on safeguarding against modern slavery, I built the Imposter, the border detection system, used to identify people trying to use fake ID. I’ve worked with Number 10 and with the Cabinet Office on internal reporting and dash-boarding tools. I’ve worked with the levelling up group at the Cabinet Office on their white paper. I’ve since then moved on to work with Skills for Care, which is an adult social care arm’s length body which focuses on reporting on the state of the adult social care industry, working with DCMS at the moment around a tool for short term lets and with the Met Office on their user feedback analysis platform. So six years now consulting in and around the public sector on various data science problems. So, let’s begin. Concerning AI. This is the narrative, right? Mitigating the risk of extinction of AI should be a global priority alongside other societal scale risks such as pandemics and nuclear war. This is the framing that we are taught to think about the ethics for AI and this is where the conversation always moves when we’re talking about AI and implementing it ethically. In fact this is a paper that’s also been published between 37.8% and 51.4% of respondents gave at least a 10% chance to advance AI leading to the outcome as bad as human extinction. These are- This is the narrative that we’re being told by these big companies by Google, by Gemini, by Llama with Meta, by open AI and their ChatGPT. The leaders are all getting behind this idea of the killer AI. The AI that is going to wipe out humanity, it’s going to take our jobs, the AI that is posing an existential threat to our way of life.

Or if it’s not AI wiping us out, it’s the fact that we need to go faster, we need to develop these tools that are the existential threat to the human race because if we don’t then our enemies, in air quotes, will. So we need to continue developing these tools. We need to make sure that we are at the forefront of developing this existential threat, this weapon against humanity because if we don’t develop it then someone else will. There is always this kind of held intention between the risk of it existing, and the risk of us not being the ones that bring it into existence.

My opinion, and I believe the opinion of a lot of serious data science practitioners and AI experts, is far more along the lines of these are the real existential threat that AI poses. So here we have deep fakes. Now if you buy a Google Pixel phone you can, or anyone who owns that phone can within just a few clicks and taps of the screen, create images that depict things like disaster scenes that depict things, like people misusing substances, they can seamlessly place people that weren’t at events into events, they can change the expressions of people in photos to convey different emotions.

This has been a technology that’s existed for a long time with Photoshop and, you know, digital editing. Now anyone can do it. Anyone with access to a phone without any skill can simply ask an AI model to generate an image, and it’s happening and people aren’t ready for it. We have already had these image creation models used for generation of sexually explicit material. The big names that kind of steal the media attention are the fact that Taylor Swift was recently a victim of this, just before the 2024 NFL Superbowl. Sexually explicit images of Taylor Swift were published onto X, and that’s a terrible story and it should not happen. But the less spoken about issue that’s arising is students are doing this. So school teachers are now faced with a wave of students, so minors, non-adults, people with access to this technology now able to create deep fake, sexually explicit images of their classmates, and Microsoft, the companies that are building these models aren’t talking about it and aren’t putting the safeguards in place

to counter these problems that are arising.

There is a huge question around stolen art, so AI- Image generation again, is basically plagiarising people’s artwork. There is evidence of this because people’s, artist’s signatures and individual styles are turning up in these AI generated images, people might say this has always been the case but what these models do is remove the ability to attribute the new work to the artists who have inspired them or whose styles have been taken and exploited.

It’s also happening in the music industry, there’s currently several law suits ongoing around basically stolen music and artists wanting to be compensated because these AI models are now generating revenue that obviously is taking revenue from the artists who produced the original music. This is- I mean it’s one thing for this to be happening in the private sector, working in the public sector we have to be incredibly careful around this because the last thing the government wants to be seen as doing is stealing artist’s work and using it as their own.

There’s also a huge amount of biased still in these models, and that’s not going away any time soon. These models are trained on the collection of data from the internet, the internet has long been in the control of a certain demographic, typically straight, typically white, typically male; and when your data is so heavily biased with the voice of this demographic it creates problems for anybody who doesn’t fall into that category. We’re seeing that in hiring, for example, AIs are still biased against people with non-Western names. So incredibly important issue to tackle and again something that we cannot let just slip through the net when it comes to implementing these models into government systems. We can’t have a racist machine learning model deciding the outcomes of court cases or against benefits claims or wherever these AI models are being implemented. They need to be- Need to make sure that they are fair and not biased and that the data has been trained and allows them to make fair-

There’s also a huge issue with worker exploitation. So, at some point in these large language model life cycles, data has been given to a person and that person has had to label that data, there needed to be a human in the loop for the creation of these models. Typically these models, these workers are in the global south. Typically they are paid very poorly, on average I think about $2 a day, and there is also a really upsetting trend of companies paying half up front for this data labelling work, and then once the work is delivered disappearing and never finalising their bill or closing out their debts.

Humans are still at the root of a lot of AI solutions, even recently we saw Amazon with their Amazon Fresh stores which they were claiming were AI driven and AI run, but they were actually using workers in India, I think it was a thousand Indian men, I don’t know why the story- A thousand Indian men were set to watch the cameras in these Amazon Fresh stores, and making sure that the systems were working and just doing the job of the AI.

Then of course, moving on to the environmental disaster that is waiting for us with the deployment of all of these models, they are hugely energy hungry, and are consuming just a vast amount of energy. I think at moment 20% of Ireland’s energy consumption can be attributed to data centres that are running these models, and that’s only set to explode in the coming years. And, you know, a lot of it is more use cases where it doesn’t necessarily require such a powerful model to sit behind as decision tool, a generation tool, these are both- AI keeps saying it’s like using a Ferrari to do a weekly shop. You’re using a Ferrari of an neural engine to basically do a very trivial prediction task that can be done by a far more efficient computer algorithm.

So that’s the depressing [laughs] state of the AI landscape and those are the things that I worry about when I’m asked to implement AI, the things that I think about. The things that I raise with customers and clients when they want to implement some sort of AI solution. Where has the data come from, is it biased against any demographics, is it secure, can it be manipulated, are we using the right tool for the job? And it’s really heartening to know that the government is also thinking about this and kind of at the core of the government’s AI adoption plan is, are these five fundamental driving pillars.

So, safety, explainability, fairness, accountability and contestability. Is your model safe? Can it be manipulated by bad actors? Is your data source safe? Is it being seeded by anyone, if it’s live data, if it’s data that’s constantly evolving, can it be seeded by an external factor or an external force to manipulate the outcome of the model? Explainability; we don’t want to be governed by a series of black box models, it’s important that we understand how we arrive at the decisions that we are making, and it’s important that the people that own those decisions are able to talk about it intelligently. Fairness, I’ve touched on, it needs obviously to comply with Equality Act, with GDPR, it needs to not discriminate or provide unfair commercial or social outcomes. Accountability; a human needs to be in the loop, right? Someone needs to own the decision, we can’t be a society run by machines, by these black box technologies. And contestability; if someone is harmed by a decision that’s made by one of these models there needs to be a route for them to contest that and face their accuser.

There’s also the generative AI framework, so using specifically the generative AI models have become so popular in the past 18 months. Again, understanding the model and its limitations, understanding the use cases for the model rather than just saying, well we guess, you know, everyone’s talking about AI, apparently it’ll be able to solve this problem. And you really need to have a true understanding of what generative AI is capable of, what it can do, what it’s good at doing, and what it’s not good at doing, or what it’s bad at, and what it shouldn’t be used for.

So, yeah, using the right tool for the job, making sure that it can’t be exploited through prompt engineering or through data manipulation to produce images that are, you know, wouldn’t want to be attributed to the government or decisions that are really impactful to people in their day-to-day lives. And all along with this, it’s really important to have the right experts in place, so we can’t expect everybody in the civil service to be a machine learning expert, have a PhD in data science and a deep understanding of the fundamental principles that govern these models. But we do need some people to have that understanding and have that awareness, and be able to advise and guide and lead people, and lead departments through that process.

So how are we doing this, right? How are we deploying machine learning, how are we deploying AI to provide impacts and really positive change and efficiency in the government sectors? I’m going to take you through a couple of case studies now.

So this one, explainable AI with Skills for Care. The problem that Skills for Care were facing is there is no requirement for the privately owned adult social care homes to report to the government how many care workers they have, and knowing how many care workers are in the sector is obviously really important for the government so that they know how to intervene, how to improve thatsector. They were using a very long manual process, there was a team of analysts there that were taking several months to collect the data from all of the various sources and then another several months to work that data with all the analysis and cleaning, and then generating and running through the predicted model that they were using.

This was alongside the rest of their day job, so other meetings, other responsibilities, other reports and tasks that they were having to do. It was very manual and a very time consuming process.

We were able to join the Skills for Care analyst team, work as an integrated team, so a combination of Made Tech data scientists, data engineers and Skills for Care analysts, and work with the analysts to upskill them in data engineering and data science. So now they’re able to maintain and run their own fully automated machine learning AI pipeline. That pipeline is automatically collecting, cleaning, transforming the data into the format that is required. It runs every two weeks and because they have a far better, far more reliable source of data, they’re able to use a slightly more complicated model, so a gradient boosted progression tree model for anyone that’s particularly interested, and that’s able to increase their reliability of their reporting. So they’re now able to report on a far more granular level, a far lower level than they were able to which is hugely beneficial for them as a reporting body that is reporting to the government on the state and the wellness of the social care sector. A really excellent project, really great and we do have a case study published about that.

The second one, so a good portion of my job these days is talking to customers and having people say, we want to use AI to do X, and when they mean, we want to use AI, they are saying, we want to use a large language model to access our data. And it’s often our job to turn around and say, we understand what you’re asking but we think that there is a better, more efficient, cheaper, more green, more eco-friendly method to do what you’re asking, to solve your problem that will deliver the same results.

So with the Met Office they collect tens of thousands of pieces of user feedback every day and the English love to talk about weather – who would have guessed. And it was until very recently the job of well, at times, just one person to take all that data and read through the comments and interpret the user feedback that they were receiving to the best of their ability. A maddening task, I’m sure everyone can appreciate reading through tens of thousands of comments, hundreds of thousands of comments every week. So we landed, again working very closely with the Met Office in integrated teams to be upskilling and building an internal capability but our data science team were able to develop natural language processing techniques. So text cleaning and stemming and [ph lamentisation 00:22:15] to get to the root of the sentiment in the comments and turn what was largely an anecdotal method of reporting into a far more analytical and quantitative method of reporting. So using aggregated statistics to track clusters of user topics or user comments that were focusing on specific issues.

We weren’t using a large language model, despite I think a fair amount of, you know, pressure to do so at times because AI models are seen- Or at least language models are seen as the hot toy to play with. That would have been incredibly expensive. We don’t want to just waste government, public money just for the story. And also it just was unnecessary and, you know, it also doesn’t help that these models are incredibly energy inefficient. So it was also burning small rain forests just to do a simple query. No, we instead used macro language techniques, that have been around for a long, long time, well established. Build again an automated pipeline that fed into dashboards. We are also integrating a natural language front end now, now that the analytic work has been demonstrated and the true value been exposed, we are now using a language model to increase the democratisation of the data allowing people with zero code to ask questions of said data. But again it’s a considered step, it’s working out what the most efficient ways to do that is, working out what the most eco-friendly way to do this is, making sure that the pipeline is safe and can’t be manipulated. Making sure that the people that are using these technologies understand what they’re doing and the impact behind them.

Yeah, again that is a case study that we’ve written up and is available to read, and I can circulate at the end of this. And with that, I think that just about brings me to the end of my talk. Here are the contact details if you’d like to reach out and chat. This is going to be shared, so don’t feel you need to take a screenshot instantly, but if you wanted to read any of the articles, I’m an academic so, you know, it’s important to cite all of the references that I have used today. As I say this PowerPoint will be shared at some point I think with the recording, so they’ll be in there as well. But if you wanted a screenshot then now’s your chance, but otherwise I think that brings me to the end of the talk, so I’ll take some questions. So I’ll hand back over to Alan.

ALAN: Thank you, thank you James. I think you’ve raised some really important topics here about both the dark side if you like, of AI and some of the concerns that many people share, and then some of the ways which you’re looking at dealing with them. So maybe I can pick up a couple of questions that are in the Q&A, and while I’m doing that if you have other questions, please pop them in the Q&A.

The first question is a little bit about bias and in particular about data collection, and the comment is about how can we improve data collection to teach AI’s ethics, and in particular it’s looking at the fact that if we can include a more diverse workforce in developing the AI models to account for bias, maybe we first need to acknowledge the bias to be able to diminish it. So maybe you can make a few comments there James, about how you see that?

JAMES POULTEN: Absolutely, so a hundred per cent I think we should have more diverse workforce when it comes to developing these models, yeah, it shouldn’t just be left to tech-prosin San Francisco, I think there should be a wide-ranging workforce with a whole variety of backgrounds and mixed experiences; and I think only through doing that and ensuring that can we get to a- Can we really address all of these topics because the more opinions are in the room, the more chance we have of highlighting issues that need to be addressed.

As it comes to the issue that these language models are facing, is that they require such a vast quantity of data that the only source really is the open internet. So, they’re now running into a problem, OpenAI and ChatGPT, they’re currently running into the issue that we’ve run out- They’ve run out of data to really significantly improve their model and they’re looking for ways to get around that. That’s why Reddit has now become such a valuable company because they have a huge amount of data that’s being created every day, real data, not data created by another AI somewhere.

So when you’re talking about a small model, absolutely, it’s important to work with a blended team. I always advocate for a user researcher to be working with a data science team, or a data team, to understand all of the, or as many of the clusters and segments of the user base as possible, to try and highlight and pick up and make sure that all demographics are represented fairly and equally. But when it comes to how do we address this fundamental flaw in these large, you know, foundation- They’re called foundation models now, these huge models that take petabytes of data and months and hundreds of millions of dollars to train, I don’t know how we combat that- I think the only real way is to put safeguards in with [s/l prompt 00:28:34] engineering and to just have an understanding of their limitations so that

we can look to kind of address those as we go.

ALAN: Thank you, and I think the next question sort of leads on from that I think, James, you were starting to say that it’s the big tech really that are using a lot of this data from across the internet, and some of the implications.

JAMES POULTEN: Mhmm hm.

ALAN: There’s a question that says, well what’s our role, and our organisation’s role when really it’s these big, large tech companies that are at the helm of a lot of these issues. What can we do, and what’s the responsibility at an organisational level particularly around things like the environmental issues – because there are several questions about the environmental impacts of these things. Some of it’s in our control but surely a lot of it’s out of control and in the hands of some of these big players?

JAMES POULTEN: Absolutely, so the best thing that you can do the take back control is have a real- So do you need to use one of these, like ChatGPT or a Gemini or one of these huge models to do what you’re trying to do? Or can you use an open-source model, but I have several that run on my laptop that I can disconnect from the internet, and it runs entirely on my computer. Well you can use an open-source model that is a fraction of the size of these models, it takes a fraction of the power and energy consumption, that is just as performant at doing simple summarisation tasks, or translation tasks or routing through logic. You know, they’re incredibly powerful models still but they don’t have that huge impact.

Another part of it is understanding that not all AI is large language models, there are different types of models that we can use. So with Skills for Care, that wasn’t using a large language AI, that was using a regression AI. That was, again, a small model that can be run locally, it actually addressed- It was designed specifically to address the business problem rather than these massive LLMs that are now being used to, or substituted just from, you know, take my data, see what it can do, see if you can provide an answer. Whereas actually what I would advocate for is, give your data to a data scientist and let them work on the smallest possible solution that provides the same, if not better answers because it’s tailor-made to your problem and to your business use case. And these smaller models have been around for a long time. They’re not burning rainforests or using vast quantities of water for each query.

There’s also an interesting point, these large models, large language models, LLMs or ChatGPTs for generalised- These are trained by private companies that are so large there isn’t a government body in the world that can really compete with them, right? They are entirely privately owned entities by private companies, by private shareholders and there is an open source community that is pooling some incredible models that you can leverage, that you can train against your data to get your use case in a more bespoke manner, which then you would own and you would have control over, and it’s not subject to one of these companies being sued into non-existence or going bust.

ALAN: So I think the next questions I think are dealing a little bit with those same issues. Part of your answer is, we need to raise awareness, we need to raise awareness about what’s happening, how to use these tools more efficiently and effectively; and even perhaps a little bit of knowledge about where you need to advocate for new things, where you need to looking for government to create some regulations. How do we increase understanding and raise the profile of these issues, and increase the education and awareness, more generally?

JAMES POULTEN: I think the last 18 months has been really interesting on this, because as I’ve said, I’ve been a data scientist for six years, before that I was from a very academic background, so aware of these processes and these techniques and using them in my research. But until 18 months ago, maybe a year ago, most particularly in the last year, people didn’t really understand what AI was or what a data scientist was. So there’s been such an explosion in understanding already, that that’s really heartening to see.

But it is a double-edged sword because now everybody hears AI, and assumes ChatGPT, where actually AI is a host of techniques and abilities and processes that can be used to answer a huge array of different business problems. So, there, yeah, the huge awareness that has happened quickly means that people are talking to me which is nice, but also it has sucked the oxygen out of the conversation a little bit because everyone just wants to talk about, can we talk to ChatGPT.

So that is something that we are currently trying to answer, and doing sessions like this an talking about where we aren’t using LLMs but we are using AI, be that regression or clustering or classification or, you know, one of these other paradigms of data types, is really important even if sometimes you do have to do the old vacant switch with the AI talk that turns into, well actually this is hard core data science. Let me show you some maths.

ALAN: Yeah, great, well that’s really helped and I hope people have taken something away but maybe just to finish up, is there something you’d like to say as a sort of final message from the talk or a piece of advice that you’d like people to take away from this that might be a useful way of wrapping up?

JAMES POULTEN: Absolutely, well a couple of things. First of all, it’s difficult now to buy a digital product that doesn’t have AI bolted on to it somewhere. Do not let them get away with that, read the terms and services because at the moment nine times out of ten, these AI add-ons, features, mean that your data is being collected. Meaning it’s being sent to a data server, typically at the moment in the US and is being used to train models. So if you have a recruitment service or if you have something that is summarising your documents or- Unless you’re explicitly opting out of these kind of bolt-on AI, I would say they’re gimmicks, they’re probably taking your data, sending it to the US and using it to train. So be very hesitant especially of that.

The other take away is, data science is an incredibly powerful tool when it comes to freeing up analysts’ time and letting them do more of their work, doing more- It’s really good at improving productivity, it’s fantastic at driving down costs because you can save hundreds of hours a month by automating some of the boring, repetitive, monotonous tasks, you increase the accuracy because you’re taking that monotonous task off the human that is not interested in doing it and giving it a machine which doesn’t care what it does. So there is so much potential in data science and AI, that it’s so exciting to be an expert in the field and finally having people coming and asking me about, you know, what can we do, how can we use this?

I would say just come to those conversations with the problem not with the solution. Come and say, we have this problem, can you help us solve it and I would say nine times out of ten, the solution is going to be not a language model and subsequently it’s going to be a lot cheaper and a lot easier to do then you were probably thinking.

ALAN: Great, I think those are really strong messages for the audience about awareness and about coming to it problem first rather than solution first. So, I’d like to thank you James for the presentation, thank everybody who’s been involved in being a participant and to remind you that these sessions will be made available online, and please take a look, and thank you again for being part of this.

Thank you James, and thank you to everyone.

JAMES POULTEN: Thank you everybody.

Back to the episode