Stefan Krawczyk - The Power of ML Dataflows using Hamilton Artwork

Conversations on Applied AI

Welcome to the Conversations on Applied AI Podcast where Justin Grammens and the team at Emerging Technologies North talk with experts in the fields of Artificial Intelligence and Deep Learning. In each episode, we cut through the hype and dive into how these technologies are being applied to real-world problems today. We hope that you find this episode educational and applicable to your industry and connect with us to learn more about our organization at AppliedAI.MN. Enjoy!

All Episodes

Conversations on Applied AI

Stefan Krawczyk - The Power of ML Dataflows using Hamilton

November 08, 2022 • Justin Grammens • Season 2 • Episode 28

0:00 | 37:44

The conversation this week is with Stefan Krawczyk. Stefan is a data-focused leader and polymath engineer. His interest has spanned design, implementation, integration, and peer education of data-related systems. Most recently, his focus has been on futurization and model-serving systems and platforms. He has a Bachelor's in Computer Science and Mathematics from Victoria University at Wellington, and a master's in computer science from Stanford University specializing in artificial intelligence.

If you are interested in learning about how AI is being applied across multiple industries, be sure to join us at a future AppliedAI Monthly meetup and help support us so we can make future Emerging Technologies North non-profit events!

Resources and Topics Mentioned in this Episode

Enjoy!

Your host,
Justin Grammens

Stefan Krawczyk 0:00

There's a lot of Python frameworks that help us kind of schedule things and kind of scale things just by decorating our function, in which case, Hamilton like we don't have to, you don't have to decorate all your functions, we can kind of just do that at runtime for you. So I'm kind of excited for this kind of ability to kind of hide the the infrastructure details, but then, you know, give people the ability to kind of pick and choose things, depending on their context, feature engineering space, in general data engineering space, I think is kind of, you know, has some interesting problems to be solved, in which case, I think Helton presents an interesting solution where it's kind of pretty opinionated, not pretty esoteric, but you can still do everything that you want to Python. It just kind of makes this this problem I was managing the central kind of, you could say script to kind of create an object kind of go away, and instead, it's kind of the framework just takes care of it.

AI Announcer 0:47

Welcome to the conversations on Applied AI podcast where Justin Grammens and the team at emerging technologies North talk with experts in the fields of artificial intelligence and deep learning. In each episode, we cut through the hype and dive into how these technologies are being applied to real world problems today. We hope that you find this episode educational and applicable to your industry and connect with us to learn more about our organization at applied ai.mn. Enjoy.

Justin Grammens 1:18

Welcome, everyone to the conversations on applied AI Podcast. Today we're talking with Stefan Krawczyk. Stefan is a data focused leader and polymath engineer. His interest has spanned design, implementation, integration and peer education of data related systems. Most recently, his focus has been on futurization and model serving systems and platforms. He has a Bachelor's in Computer Science and Mathematics from Victoria University at Wellington, and a master's in computer science from Stanford University specializing in artificial intelligence. Thank you for being on the podcast today, Stephen.

Stefan Krawczyk 1:47

Thanks, Justin. Thanks for having me.

Justin Grammens 1:49

Awesome. I just love talking with people that are doing some fascinating stuff in the areas of artificial intelligence, since the podcast and the meetup group, all that stuff is really focused around applied AI. So really excited to talk about some of the projects that you've been working on and sort of where you're headed in your career. But maybe you can bring the listeners up to speed with regards to you know, I mentioned you graduating from university and stuff, like, how did you get to where you are today? You know, what were some of the past some of the companies you worked at, and maybe talk about the trajectory of your career.

Stefan Krawczyk 2:15

So I did my undergrad in New Zealand. So I grew up in Wilmington, yeah, Silicon Valley was where it's at, for doing anything kind of computer science. So in which case, I got an opportunity to do kind of an internship with IBM in San Jose. So did that. And then I was like, My internship was was coming to an end, what do I do now I went to grad school. So I applied to grad school in California. And so I gone to Stanford did my master's in computer science there, it was, right? Before all the deep learning stuff came out. So you know, I'm still, you know, behind on that coursework, but you know, the, the PhDs at Stanford, were coming up with this stuff. At that, at that time, I did an internship at Honda Research, I was trying to figure out, you know, do I want to do a PhD? Or do I want to do more research stuff. And so I built a prototype of a spoken dialogue system there, which is kind of interesting. And the idea is that I was building a little prototype that, you know, researchers could then use for their work, but they're not, obviously, those ideas eventually, kind of make it make its way into the car. But obviously, that takes like, you know, years for a taxi to come through given the nature of just how that kind of research arm of Honda can operates. And so I was like, No, I don't want to do that. So then after graduating, Stanford actually joined LinkedIn, it was right at the time for an IPO, I guess, lucky that I landed there, because I saw, you know, tremendous growth, what it kind of means to be hyper growth company and stuff. But initially, I was doing and I kind of back end address book and boring infrastructure, I was on like a growth team to try to grow LinkedIn, there was a kind of interesting to kind of see how to be metrics focused, our product manager was very much focused on metrics. So it's very easy then to like, I guess, see that thinking and therefore, you know, if I had an idea that could move metrics, I was like, you know, I could just go forward and kind of implement it. And you know, he'd be happy.

Justin Grammens 3:43

What was the timeframe? I guess, what were what were the years we're talking about here,

Stefan Krawczyk 3:47

2010 2012. But, you know, I did this kind of AI specialization, I wasn't doing any machine learning. So I kind of switched teams for bit to kind of, you know, prototype content based recommendation products. So that was kind of a, you know, a good learning of using all the tools and seeing how difficult and what are the problems that arise? When you're there, there are all these data sets, and you don't know if you can trust them, what do you have to do to clean them, etc. And then beyond just the what is the iteration cycle, you know, like getting a model out to production, you know, was actually pretty hard. And so then I guess I wanted to go to someplace smaller. So I went to kind of next door. And was was like an engineer number 13. There. And so I got to build a lot of first versions of a few things, email analytics, infrastructure, data warehousing, infrastructure, AB testing, experimentation, infrastructure, a lot of kind of zero to one and one, two, kind of two versions of things. And so that was a good lesson on like, just how to use it, or how to get something off the ground. And while I was there, it was kind of, you know, yeah, interesting to see just how people, you know, talked about data, as a data was, you know, this was in vogue, or how do you become data driven was kind of right around that time. And so, building an experimentation system, definitely kind of helped with that. So then you could, if you had an idea, you could iterate on it, put it behind some sort of, you know, feature flag, roll it out, test it, etc. And so an intro metrics were worked, worked or didn't, and so was pretty passionate about that kind of aspect of property work. And then wanting to focus a bit more on machine learning infrastructure, because you know, next door wasn't at the time where, you know, that was super critical to the company, what it was doing. I went to kind of NLP and NLP for enterprise company, some PhDs out of Stanford wanted to kind of critters company. So I knew them kind of from Sanford exciting time was yeah, how do you build a machine learning model, and then you know, put it behind an API, like that wasn't, again, was actually kind of pretty challenging at that time. For instance, like either like back in the day used to build a spark model, but you could only use the model on Spark. So if you wanted to, like, have a web service that then kind of sent it sent you a request for a classification, you kind of had to like, figure out your own way of doing it. And so that's where I kind of got into the guts there of like, machine learning infrastructure, how do you think about like, what is the model? Like, how do you then like, take it from, you've trained it in one piece of infrastructure, but then you want to serve an API requests with it. And so then I went to Stitch Fix, where I've kind of been for the last six years just recently, kind of transitioning away from it. But essentially, I was there, you know, engineering for data science was very kind of attractive there. Because there's, I grew to enjoy building more of the infrastructure than doing the modeling. So I was less excited by data set and figuring out what to do with it. And more so like, helping someone you know, actually make make the best use of the data set without having to do a lot of work to do so. So Stitch Fix, er, got to kind of focus on a wide swath of problems from deployment, experimentation to training and inference. And yeah, and so hence, like some of the tooling, I guess, we'll discussion points today will be kind of around the stuff that I've done. It's districts.

Justin Grammens 6:33

Wow, fascinating career, man, he forked into some really, really interesting companies and, you know, gotten in early, it seems like with a lot of these ones, and just to think back, like, you know, next door, you know, they were probably very small at that time, but like, how are they going to use all of this data around neighborhoods, and just thinking through how you sort of bring that to life, I think is really, really fascinating. And then, you know, to talk with this NLP enterprise company, I was thinking about TensorFlow serving, right, that's, that's one of the it's one of the things that maybe now exists, a lot of companies are building stuff on, but at the time, you were on the early, early side of this. So, boy, I mean, how have you seen? I know, we'll talk a little bit about Hamilton and some of the micro framework but like, Yeah, I mean, just having a model and building something out of a Jupyter Notebook is totally different than actually deploying it. And it seems like you've, as you said, maybe engineering for data science is where you move yourself more and more into these days. What are some of the current challenges, I guess, that companies aren't thinking about? Or that you're seeing them? Not? Yeah, how are you trying to approach this, maybe is that a problem, it is a challenge, I guess for for companies to really deploy these models into production,

Stefan Krawczyk 7:38

part of the challenge is actually the environment that you operate in, within a company. So like, the companies that I've looked at, have, you know, they've been around, they've been engineering focus. So which case led machine learning and kind of aspects of thing have kind of came from an engineering kind of discipline or background, in which case, you know, people who are if they were doing, you know, machine learning five years ago, they've built their own infrastructure, basically. But now, like, if you, you know, starting your own startup, your own company, like, there's so many tools available to you that like, it's actually, you know, probably the hardest part is actually choosing one, you know, the problem used to be, it was difficult to kind of go from Jupiter hub, to to having an API service like that, as I want to say, pretty, pretty easy these days. And so now, I think we're, you know, part of the decision making or difficulty there is just, you know, if you're a data scientist who's supporting you, do you have an engineering team? Do you have an IT team? Can you deploy stuff yourself or not? In which case, you know, there's a bit of like, figuring out as to like, you know, what you can do yourself, I think a company is at least outside of Silicon Valley. But otherwise, I think in terms of the last kind of six months to a year, I think the the trend has been okay, it's been easy to deploy models. Now I can get my models to production, okay. Okay. Now now, there's bugs and data issues. How do we figure out what's going on? To me, it seems analogous to micro service, how microservice architectures came about. It used to be really hard to deploy service. Now, it's really easy. Everyone does, you know, micro service architectures. But then oh, look, there's all these other problems now that crop up with all these kinds of architectures. So I want to say with with marbling is similar. I think it's the analogy is similar in that it was difficult. Now, it's easy to deploy. But now there is oh, you know, there are problems now of like, you know, what happens in production? How do I know that this model is, you know, healthy, maybe it's the data, maybe they're just getting four predictions? Or maybe it's the training data somehow changed, and now you're not getting the results you expect? So I think that's where right now, I think, you know, there's a bit more of a focus and movement towards in general and kind of ml ops or industry is like, how do you know that deployment is solved, or at least it's much easier than before now, people are probably spending their time actually, you know, figuring out what's going on in production. And so there's like, a lot of tooling and, you know, startups around that kind of cropping up.

Justin Grammens 9:37

Gotcha. So yeah, it's really around I guess, I would just say, validating that the predictions you're getting are what you're expecting and maybe a lot of logging and just visibility, I guess in general, right.

Stefan Krawczyk 9:47

Yeah, I mean, so you could you can you can log stuff but then like, you know, do it's like trying to automate the, the analysis PN like coming up with tooling where it's you just connect the logs and then things can magically kind of, you get some sort Have an indicator of you know, by some summary statistics like hey, look, there's there's model drift, there's a model drift or like, hey, input features are matching, or not matching. So I think just basically the rather than people doing bespoke things as they did kind of before having to, you know, being able to pull something off the shelf and add an even pay for it, rather than building it yourself.

Justin Grammens 10:20

Yeah. So as I have alluded to, before, you know, maybe talking a little bit about Hamilton now, this is an open source, Python, micro framework, and I'll let you explain a little bit about, you know, kind of how it gets used and, and, you know, sort of how it fits into the entire, I guess, ecosystem. But the fact that it's open source, I think, is phenomenal. It's really cool. Were you were you driven by I mean, have you contributed to other open source projects? Do you find yourself being very much interested in open source? And in general?

Stefan Krawczyk 10:46

Yeah, I mean, like, especially I think, if you're in any data related space, like most of the tooling, you're probably using, especially, especially if you've been doing it for for years is like some sort of open source tooling. So yeah, hi, torch being being TensorFlow. I mean, they're all kind of open source products, I mean, then the infrastructure runs on top of like Docker, or Kubernetes, and things right there. They're all kind of open source. I think, if you had been doing this for a while, most of your stuff is open source. So I fixed bugs here and there, and various kinds of Python libraries. Because I would say most of the libraries that, you know, we use under the hood, even like, you know, if you if you use AWS using AWS as open source Python libraries, right. I mean, like, you're using our open source, in which case, it's, you know, I fixed a few bugs here and there. And then in terms of like, the impetus for Hamilton was, you know, I was kind of seeing, like, hey, all these similar tools and frameworks come out, and like, Hey, we've been, we've had this for a few years. Now, this looks actually pretty, pretty easy to open source. I think one of the hard parts with companies with open sourcing is, you end up coupling a lot of your company's concerns with things. And so in which case with Hamilton, actually, it was there weren't that many concerns a couple months ago, actually, from a standpoint of like, making the code open source, but was actually it was it was a pretty easy lift. And so which case that was like I was like, Yeah, sure. It's this open source? Let's see. The reaction.

Justin Grammens 12:00

Yeah, perfect. Perfect. You all tell us a little bit then hear about about the project Hamilton, how it got started. And and what it does,

Stefan Krawczyk 12:07

I'm still trying to figure out like, what is the most distinct way to kind of, you know, essentially, it's, you know, it's a bit of a Swiss army knife. So hence, the difficulty, but essentially, you know, I want to call it a micro framework for creating data flows from Python functions, where the Python functions are kind of described in a declarative manner. And so what I mean by micro micro framework, so it's not an orchestration system, it doesn't kind of replace your infrastructure, it actually helps you model, a single step inside your workflow. So I'll get I'll get, I'll get to that, in terms of the origin story and a little bit, but you can run it anywhere that you can run Python. And so that's why it's kind of like a microphone, because it forces you to write these functions in this kind of declarative manner. So it's this fix it kind of, you know, we created for a team that was, you know, trying to manage a large time series feature engineering kind of codebase, they are now creating 1000s of features, and this codebase was kind of becoming pretty unmanageable for them. And so which case, it was mainly the time series forecasting, for those who don't know, like most of the feature engineering that you're doing is, when you're creating new features, they're generally some derivative of other features. So in the case of, you know, with time series you're doing, you might have one, one feature, then you're doing lag, you're shifting the time around, you're doing various transforms on kind of the time series column that kind of inputs to create new features. And so which case you can you end up building this kind of chain dependency of features. And if you if you, you know, writing a panda's script, where you're manipulating a single data frame, where you're adding all these columns as new features, five or 10, you know, that script is pretty manageable. But you know, if you have like, 1000s of columns in a single script, that you know, that code base isn't isn't very nice to kind of touch. Because, you know, you lose things like, you know, unit testability Sure, you can put things into functions, but then, you know, there's there's many ways to write those functions. Documentation is difficult, because, you know, if you're doing a lot of inline pandas kind of column creation, then like, yeah, he there isn't any place really to put documentation, and then the order of the scripts kind of matter. So if you're onboarding and you're tasked with changing something, or adding something, how do you know you're not going to change something, it's really hard for you to kind of grok and so which case with that team, because Stitch Fix It was like one of the oldest teams that Stitch Fix their code base was old. And so which case the most tenured people were the most productive? And that took, like, you know, new people, new hires, you know, quite a while to kind of ramp up, because we're an engineering for data science team kind of was tasked with doing, you know, can we kind of reframe this problem a little bit to kind of, you know, get at their pain points. And so we're, the project was kind of started with the goals of, can we make everything testable? Can we make sure their documentation has been easy? And then can we kind of improve their kind of workflow for adding new features because with this team, they had to provide operational forecast to the business so they're always going to be adding new features kind of on a monthly basis because they don't give us a new marketing campaign or there was some new experiment that changed some numbers. So they need to add in go change some features, because they were really trying to model business. So before Hamilton something they had a monthly task that would take them like a day and then after Hamilton migrating antivirus using the microframework. Right? That task now takes them less than two hours. It wasn't from, you know, new technology in terms of speeding up computation or anything like that, no, it was just kind of changing the way that they write code. And so which is the kind of the core of Hamilton is that it, it kind of forces this, this this way for you to kind of think of and describe and run kind of write features, in terms of just to give you a bit of a sense of like, what the code actually looks like and what these functions you're actually writing. So picture a script, a panda's scripts, where you have a data frame object, and you're kind of just creating your columns. In Hamilton, you would create a function where the name of the function describes, you know, is the name of the output column that you're trying to create, yes, and then the input arguments to the function, either input columns that this kind of function requires to do to perform this kind of transformation or computation. And so the body of the function becomes your codes, the logic, you can use function docs, because it's a Python function. So and then, because of the way that the function is written, it's very easy to unit tests, because it's, you know, you're not leaking, where the data comes from, you're just kind of writing this function where the name of the function is, well output. And then the inputs are the kind of input functions. And so once you build up these functions, we then stitch everything together, we use do some computer science one on one, and we build a directed acyclic graph. And so we stitch everything together by now. So if a function needs an input called date, we will either look for a function called date, or expect date to be passed as an input type thing. And so what this paradigm really changes that you're writing your logic, independent of how you're going to call it and how you're going to kind of create it. So you kind of write things in two phases, or there's like two steps. So you write these kind of Hamilton functions. And then you write a little driver script, which kind of says, you know, what, what is my dag or directed by cyclic graph that I want to build? And then what are the outputs that I want, and then we will deliver the framework, we've kind of walked the DAG to only compute the things that are needed, and provide you kind of the result from that, assuming you provide the right inputs. So that's kind of you know, Hamilton, roughly high level, but yeah, it's open source. Yeah. Check us out. Of course, there.

Justin Grammens 17:07

Yeah, no, no, no, it's phenomenal. This is awesome. And I think so we have liner notes for the podcast, and I will absolutely put a bunch of links, and it's just github.com forward slash, stitch, fix forward slash Hamilton, I think is what I've been looking at. Yeah. And when you kind of walked us through sort of your readme here with with regards to super simple for people to install, just a pip install, you sort of walk through a sort of first hello, world example where it's like, Hey, you have these columns of things, but maybe you want to do and in this case, I see an average for three weeks spend, for example. And so it feels like it's auto documenting, right, it kind of forces you to write the documentation along the way. Is that Is that true?

Stefan Krawczyk 17:43

Yeah. I mean, yeah, I mean, you even like at Stitch Fix, the team ended up settling on like a naming convention, because you know, rather than having some variable named foo, like Hamilton really forces you to name the functions to be like, topic meaningful along with the inputs, then when you come to read it, you can kind of go, oh, this is, you know, average three weeks band is the name of the function, and then oh, look, what's the tempo? Oh, it takes spend as input, and then maybe some other parameters or something? And then, then it's kind of pretty clear for you to kind of read or at least try to understand, then when you get rid the body of the logic, like, oh, look, it's doing an average. And oh, look, there's a number three, okay, I kind of understand what's going on. This was a, you could say, gut instinct, didn't know how well this would turn out. But essentially, that's kind of it kind of forces, better naming and then obviously, and then the code is a little bit more readable. And so which case like it's, it's districts, the the team, you know, onboarding became much easier, much simpler for someone to get started with. Because one of the also the nice things with focusing on the naming is that if you have some output, and you won't understand, you know, what created it, you know, just grep the code or Command F, for that name, and then you'll, it's pretty easy to find the definition. Or at least like, Hey, I wonder what uses this, it's very easy to figure out the kind of the dependencies of you know, where something is also consumed. So updating the code, maintaining it, finding things about it as much simpler, because it's like, everything is forced to be kind of broken up into functions.

Justin Grammens 19:02

Yeah, yeah, it feels very, like behavior driven development, I guess, in some ways, where you sort of yeah, do a lot of explanation. Now, does that help them with the data scientists, I guess, people that are on the data scientists team.

Stefan Krawczyk 19:13

Yeah, I mean, so they, the code base now looks pretty uniform, I guess. I think that's one of the things that, you know, you don't really learn about when you're writing code, until you've written enough of it over time and have to come back to your own code, etc. And so in which case, you know, the code base is pretty uniform, everything's kind of done in this kind of very structured way. Because the Python functions, you can then curate them into modules. So then you can thematically have all the date features and dates.pi and all the marketing features and marketing.pi. So like, it helps kind of organize and structure things. And so obviously, maybe there's a little bit of friction to get started because you're not doing a one line pandas kind of statement you are writing a function and you have to like figure out where to put which module it should live in, but then longer term, then that means anyone who's coming back to the code maintaining etc As like that Add is just much, much simpler. And so we can also visualize the execution. So given the graph nature of the computation, we can also output, you know, give you a visual picture of what's going to happen as well. Like, it helps, you know, the data scientist, you know, from the iteration perspective. So if they have to come and add things, so, you know, Hamilton will tell you, if there's, if there's two functions doing the same thing that you're trying to build a deck for, in which case, like, it's very hard to step on someone else's toes. So if you want to create a new feature, you know, it's just that's in your function. And you know, that's independent, as long as it's, you know, named something different. And then if you want to update or change, like how something is computed, you then have to, you know, add as a new dependency in that function. So then, you know, it's very clear to someone who's reviewing the code, like, Oh, what are the what are the impacts, what's going to happen, it just helps them to speed up the the general maintenance activities, and you know, just manage the codebase without fighting it,

Justin Grammens 20:49

is really cool. So at the end of all of this sort of close the loop, then for us at the end of this, you just you end up getting a data frame, that is sort of a combination of all these features that you added, right? I mean,

Stefan Krawczyk 21:01

Hamilton is an works of any, any Python datatype. But if you modeling any kind of tabular type thing, Campbellton kind of Yeah, works really, really well. So especially for pandas stuff, it does get you to think about things that are columnar fashion. So I think one of the things that most people, you know, at least I remember going back, you know, you're used to for loops and doing everything in like a row based kind of processing fashion. But Hamilton forces you to think even more of a columnar fashion, you can do things row based, but then the upside is that most of your computation can also be quicker, because you know, vector computation is faster, and you're working over columns and stuff. So at least it's districts and what a great well, we'll try to manage a panda's data frame. So which case Yeah, you just scattered, you know, you can get a panda's data frame. But if you wanted to build a psychic learn model, you can also model that kind of ETL kind of process with Hamilton as well.

Justin Grammens 21:50

Yeah, awesome. It looks like it's been. I mean, there's, it's an act of development. I see, you know, people committing stuff, just you know, four days ago, 10 days ago, all that all that type of stuff. What's the, what's the future? For right now? I mean, you you're looking at what you're going to do next, in your in, in your career, what do you what do you see for this project?

Stefan Krawczyk 22:09

Yeah, I mean, so Hamilton, I think is it's interesting from the perspective that it's trying to like to leak too many kind of underlying computational kind of platform infrastructure kind of concerns. So that's one of the things that stitcher because like, yeah, we wanted to build an API where the data scientists could do their work. But without really having without leaking, you know, that's running on Spark or where it's kind of running. Because ideally, we can, as a platform, then, you know, change things without people having to migrate. Because that was one of the taboo words at Stitch Fix for data scientists, migration events, or, you know, being able to build trying to build an API where you didn't have to migrate, or at least the migration, paying if there was some to bear was like, pretty minimal. So I pretty kind of excited where we're at. I think by the time this podcast drops, we'll hopefully have a data quality kind of feature. So one of the nice things with with having functions is like the idea is that well, wait, why can't we just add a decorator. So this is like an annotation. So when you say a Python, you're to add something above a function, that's what I'm kind of describing as a decorator, wouldn't it be cool to also have a runtime expectation set there. So if you have your computing, you know, the age kind of feature or something, you can then say, hey, this isn't this should be above zero, there should be between zero and 120, or something. And that will be in the code. So then the the maintenance of the code, and it's like very easy to kind of see when you if you would look up the logic or understand what's going on, like the test is kind of described with the code. So I'm kind of excited for that. And then there's a lot of Python frameworks that help you to kind of, you know, schedule things and kind of scale things just by decorating a function, in which case, Hamilton like we don't have to, you don't have to decorate all your functions, we can kind of just do that at runtime for you. So we have an integration with Ray and dusk, where you know, it's very easy to kind of scale the computation with Hamilton without you having to kind of rewrite your code or logic or even know that you want to scale onto the systems, it's very easy to kind of, you just got to change what we call it the driver code. So the thing that kind of tells Hamilton what what dag to build, and then you just got to say, Hey, I wanted this to be computed on rail Dask. And then if you happen to be using Pandas, then we also can delegate to pandas on Spark, which came out in 2.2. So that you can kind of write your functions, and then they'll just run on Spark as well. So I'm kind of excited for this kind of ability to kind of hide the infrastructure details, but then, you know, give people the ability to kind of pick and choose things, depending on their context. And so, you know, that's kind of roughly where, hopefully, I'll be spending my time but otherwise, yeah, I mean, the feature engineering space in general data engineering space, I think is kind of, you know, has some interesting problems to be solved, in which case, I think Helton presents an interesting solution where it's kind of pretty opinionated, not pretty esoteric, but you can still do everything that you want in Python. It just kind of makes this this problem of managing the central kind of, you could say script to kind of create an object, and I'd go away and instead it's kind of the framework Just takes care of it.

Justin Grammens 25:00

Yeah, that's awesome. So how long has Hamilton been around? When did you start the project?

Stefan Krawczyk 25:05

I went live at Stitch Fix, you know, November 2019. So we actually, you know, it's been running, it's been running in Stitch Fix for, for quite a while before we open sourced it. We open sourced it, I guess, October 2021. And yeah, there's we have a community on Slack. I've gotten some interesting conversations with people just trying to get Hamilton up and running. So for instance, you know, I was talking to a consultant the other day, and he was like, Hamilton sounds like, a pretty great tool for me, because it ensures that, you know, I leave code that is documented and unit testable for my life. Yeah, I'm bullish, obviously, you know, slightly biased. Yeah, watch the space.

Justin Grammens 25:41

Yeah, for sure. For sure. It's exciting. You're gonna continue to sort of work in this in the same arena, you think you're going forward? Personally,

Stefan Krawczyk 25:49

I mean, I want to say so Stitch Fix, we were, I think slightly kind of everyone got ahead of their time in terms of being trying to enable the person who's doing the data, work to take things all the way to production? And I think, you know, that's a bit of zeitgeist of other times, as well as, like, how can you enable the person who's doing the modeling to also take to production because I think, you know, if you allow that you increase kind of iteration speed, because there's less people to talk to. Now the downsides are, you know, they might not be software engineers, and like, you know, in some domains, that could be, you know, a little tenuous, but then I think, like, with the right tooling, and infrastructure abstractions, I think you can get pretty close to enabling someone to who isn't necessarily, you know, a super awesome engineer, to have the tooling to be able to, you know, get stuff to production without worry about it. And then so then allowing them the time then to really focus on what's driving value for the business, which is probably building and creating better models. Yeah, for sure. So yes, I think I think I'll stand the space, because I think, yeah,

Justin Grammens 26:45

are passionate about it, for sure. Keeping keeping data scientists productive, focusing on what they do the best, thank you for the work that you do. I mean, a lot of these frameworks just in general, are sort of thankless professions, in some ways, people spend a lot of hours a lot of time focusing on building tooling and tool sets. And, and you know, then other people sort of, I guess, that's open source, it's standing on the shoulders of giants, but anybody that spends time, you know, their own free time building, building things for the open source community has a ton of respect in my book. So thank you very much for doing that. As people are coming out of school, for example, say these days, I mean, what how do you suggest they get into this field? This is a question I love to ask people that have maybe been in this data science, machine learning space,

Stefan Krawczyk 27:25

I mean, you should understand the tools that you're using. So the nice thing with open source is that you can kind of go at, you know, look at the code, you can learn a lot by just reading code. And so one of the one of the things that I learned, I guess, in my career, I mean, if you happen to go work at Google, or, you know, one of the bigger companies, you know, reading their code bases has actually, you know, like something that will kind of teach you a few things, because I think there are, you know, problems and patterns problems that crop up perennially. Right? And so, like, what are the patterns and approaches that you can kind of use to solve these problems, how to deploy something to production, I think, you know, there, there's a lot of open source tooling. So which case you can do if you're, if you can read code, and, you know, like, I guess my general methodology is, you know, read code, draw it during the representation, so you can kind of understand the full picture, and then like, that can give you at least some sort of mental model of, you know, like, why things are working or how things work the way they do. So then, that would, that I think, provides you as maybe a better understanding of the tooling, or at least, maybe even more insight as to like, how you can use it better. And then with open source, obviously, there's the other side is that, you know, you can then potentially contribute back or create issues like, Hey, would this feature fit with this thing? And so I think, you know, tools and stuff, like love feedback around these types of things. But yeah, I think if new college grads, I guess the the downside right now is that there's a lot of tooling that abstracts a lot of the things that you used to have to, like, build yourself. And so which case, that's why I'm like, Yeah, you should read the code base, so you can at least know and maybe, then you also understand a little bit history and know where things came from. But yeah, otherwise, you can move so much faster these days. So the other way is just, you know, use the tooling to just build as quickly as possible. So you can iterate and, you know, develop that kind of, you know, experience and even like, your intuition for, you know, what's a good idea? How fast does it take? Because I think it's the other thing that you learned through industry and experiences, you know, like, sizing things and in which case, knowing like, what are the hard problems or knowing how long things will take in which case, and then like, what are like, you know, the methods that you should be looking at, to kind of start a project, I think, so, the only way to get better at that is just to kind of iterate and try to do you know, more new projects each time. So, you know, this is, you know, true, I think, for engineering true for data science, so, like, what features or how do you explore data, you know, what should you do first to really understand the problem before you're going on, right? And then you can then relate that back to us or, you know, whatever business you're working at process, because if you know, the sizing and potentially like, what are good ways to start or solve problems, you can then identify, you know, the business value or potential business impact, you can then also then, you know, help that target like, way you should focus your word or time,

Justin Grammens 29:51

it just makes you more valuable, you know, as a as a person working with the company or for the company, whatever it is, and it's good. That's one thing that I tell engineers ISS is, you know, you people don't understand why they're building what they're building. It's a you know, the more that the engineers know, the better they can actually do their job. It's not just take hold this and make this do this certain way. It's we want to build this solution here, what do you think is the best way? So the more an engineer can raise their hand and say, Well, I'm gonna attack it this way, I'm going to try it this way. Or some ways, as you were talking about, I was saying, you know, wow, just just try and break the system, you know, like, look at it and say, Well, what happens if I what happens if I throw this variable in? Or what happens? If I do this? You'll learn a ton just by actually getting compilation errors or, or getting data spit out a different way? It's like, that didn't react the way I thought it did. So why don't I dig in further to understand it? Yeah. Well, how do people reach out and connect with you?

Stefan Krawczyk 30:41

So I, yeah, I'm on LinkedIn over Twitter handle, feel free to ping me on Twitter, our LinkedIn as well? Yeah, I don't have I don't have a blog or anything, actually. I mean, I've been trying to publish a little bit on towards data science. So you can also have access follow me on medium as well. But yeah, happy to take I guess, questions or, or Yeah, someone's to message me asked me about, you know, something that I did, or career choice, you know, happy to try to help if I have the time, for sure,

Justin Grammens 31:06

for sure. And I'll just, I'll do a quick plug in here. You'll be speaking at our applied AI meetup on November 3, so I'll be sure to put a link in that in the liner notes as well, of course, links off to your LinkedIn and all other stuff. You know, is there any other I guess, topics or projects that you find interesting? Are there things that maybe you wanted to talk about that we didn't cover today,

Stefan Krawczyk 31:24

one of the projects I've spoken about before was was this project called the model envelope, it was a core abstractions that essentially enabled us to kind of build a little framework where you could save a model. And then a data scientist didn't have to write the code to kind of create a web service that, you know, spout predictions from the model or write a kind of better batch task to kind of run the model in batch. And the the metaphor we're kind of going for is it's an envelope, because it's not only the Model button, but you can still stuff out the things with it. And so essentially, it was a kind of a self describing it contained all the data, we needed to be able to kind of self describe a model such that given a context, we could kind of generate the code to kind of run that model in that specific context. So usually, with Python models and things, you need to get your Python dependencies correctly. Similarly, like if you want to create a typesafe API, so you want to check that, you know, the values are INTZ, or strings, or floats, you kind of had to, you know, manually do that. And so which case, we bought a system where it's like, we given a model, we figure out what its API inputs, and outputs are obviously with some help from, you know, what you pass to us in the kind of the Save API call. And then it's like, essentially, it kind of we treat it as a black box. So like this kind of UDF. So a user defined function, but with with state, because it's a bottle. And so from there was yeah, we, you know, a data scientist could you know, save a model and then under an hour have their model, that's kind of the crux of, of the attraction was like, Yeah, we saved everything we needed to know about a model, and then getting it to production was just, yeah, just a few clicks of, you know, a UI and some configuration like the name of the service, you want it to be deployed. And then yet data scientists could easily get things to production pretty quickly.

Justin Grammens 33:01

That's phenomenal. Yeah, no, that's so this. I guess, I saw that you presented here at ml conf online. 2021, sort of talking about this.

Stefan Krawczyk 33:09

Yeah, yeah. So spoke about ml calm for also did a session there, Stanford's kind of machine learning systems design course, on this topic, as well as a guest lecturer there on it. But essentially, yeah, it was an abstraction to kind of get enabled deployment for free as we kind of called it. But the flip, the other part of it is from an ML M ops perspective, it actually enabled us to kind of control and do a lot, which, you know, previously, you wouldn't kind of as a platform team necessarily have all the kind of the ability to kind of do so since the abstraction was you, given you just trained a model, you would save it, there wouldn't be like a deploy command at the end of your kind of model training process. Now, we purposely kind of abstracted it that you had to create this kind of rule set to then you create your you curated your models in the model envelopes with tags. And so then given some tags and the properties of the model, say, we only want to deploy the model, if it was trained off of the main branch from a particular Git repo, and it has these kind of tags, where they're kind of, you know, automatically say, deploy a web service with that model, because we controlled all the deployment prospects are in the end, and all of the deployments kind of end to end, it meant that, you know, we could, it's very easy for us to kind of change the underlying architecture of what happens and data scientists don't have to do anything to benefit from it. If we want to ensure that all machine learning models have you know, data dog integration with a specific kind of setup, it was very easy for us to kind of do so or data scientists just go for free when whenever we upgrade and, you know, touch the system. So from a platform perspective, it was a great attraction for us to really be able to kind of, you know, as we got better at things just, you know, be the tide that kind of raised all boats, where it's kind of before when everyone was kind of deploying their own models doing their own bespoke things, it was very kind of, you know, some models had very good you know, observability and kind of coverage or even, you know, deployment kind of systems whereas was small envelope, we kind of standardize a lot of it and then made it a lot more amenable for platform teams to kind of manage without having pains. Without the data scientists actually have to migrate any code whenever we change them. Yeah,

Justin Grammens 35:12

no, that's awesome AI is looking at here, they says no code needs to be written by a data scientist to deploy any Python model to production. So super useful there. In that particular case, similar in

Stefan Krawczyk 35:23

spirit to kind of ml flow, or model dB, which were kind of the the entrance in the space, we bought it, because, you know, we, at the time we started building it, there was no, nothing open source, they kind of fit out though, most of the open source systems at the time, were very much focused on making individual, you know, productive, but it's just, we had, you know, over 100, data scientists and so which case, building a system where they would scale to enable, like, you know, a lot of models and kind of being able to arbitrarily organize them was kind of one of the key kind of requirements for us, in which case, we went with kind of a tagging model, which then allows you to kind of, you know, arbitrarily create hierarchies based on the tags that you kind of use as a way to kind of group and manage bowls, we also then that enables us to build our own kind of CI CD. So continuous integration, continuous deployment kind of process. So if you wanted to deploy your model to a staging environment, run some checks, and then deploy it to production is very easy to kind of set it up with with the Malian blood kind of deployment system.

Justin Grammens 36:16

I love it. That's awesome. That's awesome. Well, on that note, Stephen, really excited to have you on the program, this was great conversation. And I'm really excited to sort of get this published and put out into the ether. So all of our applied AI listeners here at in the community can get a chance to download Hamilton learn from it. And also, you know, we'll be having you present as well to our to our group in a couple of months. And so that'll be that'll be awesome, as well. So thank you so much for your time today. Thank you for sharing your insights. And for all the hard work that you've been doing in this data science community to sort of start putting together a lot of these these tools that I think a lot of data scientists need. It's a lot of the infrastructure and the plumbing and stuff that I think is missing to really, you know, I guess, provide the full context of what artificial intelligence can can do today. So without all of the plumbing, and unlike a lot of the infrastructure, it's not going to happen. So thank you again. Appreciate your time.

Stefan Krawczyk 37:09

Thanks for having me, Justin.

AI Announcer 37:11

You've listened to another episode of the conversations on applied AI podcast. We hope you are eager to learn more about applying artificial intelligence and deep learning within your organization. You can visit us at applied ai.mn To keep up to date on our events and connect with our amazing community. Please don't hesitate to reach out to Justin at applied ai.mn If you are interested in participating in a future episode. Thank you for listening

Justin Grammens

Host