Transcribe your podcast
[00:00:00]

This episode of Founders' Field Guide is sponsored by Cliffview, one of the liver marketing moments that last a lifetime, Clavius, the ultimate marketing platform for e-commerce with targeted segmentation, email automation, smart marketing and more, Clavel helps you create your ideal customer experience.

[00:00:15]

SeeWhy more than 50000 brands like Living Proof, Solar Stove and Nomad Trust Clairvaux to grow their business. Keep your customers coming back. Get a free trial at Clivia Dotcom Founders. That's Clay v. Wired.com founders. Stay tuned. At the end of the episode where I talk to Clavel customer Nomad on their origin story and how they work with Cliffview.

[00:00:37]

This episode has also brought to you by Venta. Does your startup media çok to report to close big deals or do you already have a stock to report and want to make it easier to maintain? Venta has built software that makes it easier to both get and renew your SOC two with Ventus continuous monitoring solution. You avoid hosting auditors on site and taking hundreds of screenshots to prove that you are compliant so you can focus on building your business. Vantiv partners with audit firms who file your SOC to report directly inside Eventa at a fraction of the normal cost.

[00:01:05]

Hundreds of companies, including more than one hundred Y Combinator businesses, are leveraging van't us today to streamline compliance and focus on building their businesses. Founders' Field Guide listeners can redeem a one thousand dollar off coupon.

[00:01:17]

Advanta dot com forward slash Patrick. That's Venta dot com forward slash Patrick.

[00:01:23]

Hello and welcome everyone. I'm Patrick O'Shaughnessy and this is Founder CEO Guy Founders. Field Guide is a series of conversations with founders, CEOs and operators building great businesses. I believe we are all builders in our own way and this series is dedicated to stories and lessons from builders of all types. You can find more episodes at Investor Field Guide dot com.

[00:01:46]

Patrick O'Shaughnessy is the CEO of O'Shannassy Asset Management, all opinions expressed by Patrick and podcast guests are solely their own opinions and do not reflect the opinion of O'Shannassy asset management. This podcast is for informational purposes only and should not be relied upon as a basis for investment decisions. Clients of O'Shannassy Asset Management may maintain positions in the securities discussed in this podcast.

[00:02:11]

My guest today is Ali Godse, founder and CEO of Data Brix, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open source project that data BRICS is built on and is an accomplished researcher at UC Berkeley's Computer Science Department. Our conversation ranges from the origins of distributed computing to modern data infrastructure, how companies can leverage their massive data sets and the transformation of data bricks through its phases of growth as a business.

[00:02:38]

While technical, it's exactly the kind of conversation I like to have on this show. I hope you enjoyed my great conversation with Ali Guzzi. So, Ali, I'd love to start our conversation at the end with what data BRICS is today to level set for the audience exactly what you do, what your focus is and what the business does for customers.

[00:02:57]

Could you just walk us through as we sit here at the end of twenty twenty what the company looks like in the service or problem it solves for customers where 70 year old company, we have about seventeen hundred employees and we help enterprises take massive amounts of data and do machine learning AI and data center so that data most enterprises, they've seen how Silicon Valley for tech companies have used data in a really strategic way to disrupt industries. They want to do the same thing, but they don't have thousands of engineers that can help them build a data platform custom for their use case.

[00:03:31]

We've built that and we enable them to do that.

[00:03:34]

I would love to go all the way back and sort of tell the history of distributed computing because everybody will have heard the term big data. This was a really popular term, I don't know, five or seven years ago. And I think that concept, that term, the fact that it was being talked about in normal business circles was the result of progress in the world of distributed compute and storage. I'd love you to rewind, however far back you think is appropriate to go.

[00:03:56]

Maybe it's back to the twenty six Yahoo days. Tell the modern history of distributing computing what it means and why it's so interesting and important.

[00:04:06]

I think what happened is that around two thousands we hit this wall. We call it more as wall because they didn't figure out how to make computers faster. So everything started moving into these data centers, new computer. And it was a new data center in these data centers where you had hundreds or thousands of machines. People started collecting more and more data. And the reason for this was multiple. One was the price of storage kept going down. So it became cheaper and cheaper to store all this massive amounts of data.

[00:04:33]

And no one wanted to throw it away. And they had heard that there were some more tech companies like Google that have gotten a lot of value out of the data. So they wanted to do the same thing. Secondly, more and more people were connected on the Internet. There is sites that had millions of users attached to it that were coming visiting these sites. There was this aspiration that we collect all this data. Maybe we can do great things with it.

[00:04:55]

I think around two thousand, five thousand six people still going to know. I mean, the four tech companies knew they had typically an ad business. They were collecting the data. They were optimizing how to show ads to people. The rest of the enterprises didn't really know what to do with it. So this big data revolution sort of entered its first phase, which is let's collect all this data on the amazing things. We can do it once we get there.

[00:05:14]

It's cheap. Why not do it? So that's kind of what started around that time. And people started collecting these things into data, massive, massive data sets. And back then, the measurement of success was how much data do you have? We were very successful. We have one petabyte data. We were even more successful. Data has grown from one petabytes to provide. This is amazing. So I was like kind of the first generation.

[00:05:35]

What exactly was going on? Almost down to the hardware that was revolutionary back in the early to mid 2000s that made some of this possible. So people are used to having their own data on their own computer. Everyone's familiar with the term cloud. But I think early on everyone would think of cloud as someone else's computer, a computer somewhere else, not under my desk somewhere else. What literally was happening from the hardware and software standpoint in the early to mid two thousands to make some of this possible?

[00:06:04]

Yeah, I mean, early twenties, you would buy a big supercomputer and solve a lot of problems. In fact, I remember Berkeley, we was the Twitter and the very early days when they had one giant machine. We processed all the tweets. I don't remember how much memory, but it was some gigantic amount of, wow, how did you get a machine without much memory? And that was the way to solve problems. You would get a very expensive supercomputer.

[00:06:24]

It would do all your computations for you. But as we hit this Moore's wall and it couldn't scale these computers around two thousand five, the CPU speed, staggering computers were not getting any faster. And more computers are basically three gigahertz system. So that's kind of started, meaning that we probably have to distribute. Our needs are not going down. The amount of data we have is not going down. The number of computing processing capacity we need is just increasing exponentially.

[00:06:48]

We probably need more machines. So that's when things started distributing out into data centers. It's soon then became feasible for every organization to manage thousands of thousands of machines on their own on Prem. So the cloud revolution kind of took off or said, hey, this is a utility. Why don't we manage all these thousands of machines for you in the cloud? You can just use them. This also accelerate things because now anyone could pretty much, with a credit card, start renting thousands of machines, this new computer in the cloud and start storing things on very cheaply and get started.

[00:07:21]

What was the harder challenge back then between hardware and the software needed to coordinate putting stuff into those systems and then accessing and pulling it out in a timely fashion? What was the most innovation that happened to make that possible?

[00:07:33]

Well, actually, a few things changed in the very early days. The problem was there was so much data and the networks were not fast enough. So whether it's in the cloud or whether it was own data center that you had the big innovation of mass produced the big. If you go back to that time was we have to move the computations close to the data, you specify what you want to do with your data. But then you had this thing called mass produce, which was really clever about let's move the code that runs on this data close to the data because there's so much data, we can't transfer it over the network.

[00:08:03]

We try to transfer it over the network. The whole network will collapse. That was the name of the game back then. And there was a lot of research on how do we avoid that turn up. You don't actually have to need to do that anymore. You say a little bit more about that.

[00:08:15]

So I want to make sure I understand. So let's make this stupid, simple. I've got a terabytes of data sitting somewhere on actual hardware and whatever it is, it's customer data or something. If I want to do some sort of compute on top of that data, run a query or produce some output, the network was the limiting factor. So just getting the data from there somewhere else to do the compute was hard. So some innovation was just do the compute in the same data center or something.

[00:08:39]

Am I getting that right?

[00:08:40]

Yeah. Let's take an example. Let's say you have a lot of data on how people have clicked on a website. Since you can't buy a supercomputer anymore, you've probably distributed that over many machines. So you have hundred machines, they have hard drives. And you started on the hard drives of those machines, the data, the petabyte of how people have clicked on your website. But now you want to compute something simple example. We just want to see what's the average number of clicks on a particular link.

[00:09:04]

Twenty years ago would be just write the code that takes all those numbers and averages them and then go over all of the data set.

[00:09:12]

But here to do that would have to move all the data from these hundreds of machines to the machine that's doing the competition and that would just crash. And in fact, we did that at UC Berkeley. We were developing the first version of the spark. We crashed parts of that network. They called us up and said, what are you doing with the well, we're transferring all this data from all these machines to every other. So that was the name of the game in the early days.

[00:09:33]

How do we avoid that? How do we actually just move pieces of the computation that computes the average just to all these machines, let them compute it locally and then send some kind of aggregate?

[00:09:43]

So you mentioned that originally that was a unique solve, but then today we no longer have to do it that way. So what's changed?

[00:09:49]

Networking technology, unlike CPUs, have just gone faster and faster and faster. And they've also come up with techniques. This is actually research that started at UCSD. They figured out how to configure the networks in these data centers in a way such that any two machines can completely at full speed communicate with each other. You can no longer collapse the network with these heavy computations. And in some sense these days, the network has been completely marginalized and it's no longer in your way.

[00:10:20]

So we no longer need to actually move the code close to the data. And that happened around 2009, 2010, 2011. And today, virtually every data center, every public cloud provides really, really fast networks that just won't get in your way.

[00:10:35]

Can you describe the difference between doing something from storage like you do first described and doing it in memory? I think that's an important difference to describe for the audience. And then I want to talk about Hadoop, its origins, how that bleeds into the origins of Spark.

[00:10:50]

Back then, memory was expensive again. That's something that's changed a lot. Memory's gotten way cheaper, storage got way cheaper. The CPU is all kinds of things that have stagnated in the last couple of decades. So with these big datasets, it would have to do the processing mostly on this. So the data sets on this, you load it in, you do some computation, you write it back to this because you don't have enough memory to store all of it in memory.

[00:11:15]

I think some people memory and hard disk might sound like the same thing you just described literally what the difference between those two things means.

[00:11:22]

On the hard drive. You have persistent storage. If you turn off the computer, the data remains their memory. When you reboot, it goes away. But the difference is it's much, much, much faster. So it's orders of magnitude faster to access data that's in memory and you typically have much less memory. Then you have this perfect.

[00:11:39]

So can you describe the origins of Hadoop and why you're in the data world? Everyone knows that this is a key milestone in the timeline of doing work with big data sets. You describe what Hadoop did and why that was so interesting and important.

[00:11:52]

Yeah, Hadoop enabled you to process larger data sets than ever before in parallel on hundreds or thousands of machines. Think of it early. Two thousands. This new machine or computer arrives, which is the data center with hundreds or thousands of machines. And it's the kind of operating system for it that was developed. The first operating system was Hadoop and marriages, which was a way in which you could basically process any amounts of data you wanted to process. Unfortunately, though, it was a little bit complicated.

[00:12:20]

Just like early computers and early operating systems, all the computations would have to be described in terms of just two functions, one called map and one called Reduce, which made it very cumbersome and complicated to write programs for this. But it was amazing. Now you could crawl the whole World Wide Web processing and do computations on it if you just were sophisticated enough.

[00:12:42]

What were some examples of interesting use cases of.

[00:12:45]

This new distributed computer technology that wouldn't have been possible maybe that were happening at around twenty ten or something like that, that wouldn't have been possible in, say, 2000, the first use cases of this were largely it started with crawling the web and building indices for all the stuff that you call all the large and other use cases where click logs, logs of how people are clicking on these websites have billions of users on the website and they're clicking around. How do you collect all of that to make sense of it so that you can actually start doing more advanced things with it?

[00:13:15]

For instance, maybe you can figure out what ads to show someone or what's more interesting to them. So you could not for the first time do this really, really large scale computations, which you couldn't do before. It was a major breakthrough, but it was very, very hard to those programs. It was not normal programmers.

[00:13:30]

So now, having laid a lot of great groundwork, hopefully people will have some context. We can get to the origins of Sparke itself, the open source project around which data Breck's is built. What was the very beginning of Sparke who designed it? What was the intention? How did it get going?

[00:13:45]

The real story is that there was actually a group in the lab, so we were sitting together at UC Berkeley and the people that built this computer systems, which was our group, and the people who were just doing pure machine learning, there were more sort of math background. We're sitting right next to each other because the idea at Berkeley is all these people should work together and come up with really great things. And they were trying to do recommendation. So in particular, they were participating in the Netflix competition.

[00:14:07]

And the Netflix competition was a competition in which Netflix showed you how people had rated movies and you had to come up with a machine learning predictive model which would recommend movies to other people based on their preferences and how movies have been rated in the past. And whoever got the most number of people clicking on their recommendation would win the contest. And when I think half a million dollars, a million dollars, the machine learning team was saying that this was really, really hard to do for them.

[00:14:32]

And we started talking more closely them what's going on? And they said, well, we're using this Hadoop thing and it's just so slow. And we were doing many iterations over the data because these machine learning algorithms are highly iterative in nature. You go over to data again and again and you keep refining it until you get good enough, and that every time we have to load all the stuff from disk, put it in memory, do some iteration on it, write it back to disk, it's just too slow a second too long.

[00:14:56]

Can you guys help us? That's what it started with. And we said, yeah, memory is getting pretty cheap. Why don't we just figure out a way where we can just load all of this data into the memory and then we can just do the iterations super fast in memory because that's faster than this. That was the origin of spark. So in some sense, it was a tool created to be able to really fast machine learning to win the Netflix competition.

[00:15:18]

Fantastic.

[00:15:19]

And what were the first year or two of Sparke like what was happening in terms of the growth of its participants, its use cases? And how does that then translate into the creation of data BRICS you've got as an Apache open source project that in many ways anyone can access and then data BRICS as a corporate sponsor or entity that sits next to it on top of it around it. I don't know how you would describe it, but describe those early days and same question around data BRICS.

[00:15:45]

Why did data BRICS come to be in the same way that spark came to be?

[00:15:49]

Popularity or spark took a very long time to take off, so it started in 2009.

[00:15:55]

People were excited then as an academic project, they had a lot of impact and a lot of excitement on it. But out in the sort of you went to industry, no one knew about it. It took many years, two thousand, nine, 10, 11, 12. Not much was happening. And we were really trying to get the industry to adopt it. We went to these companies that were at the time they had this Hadoop technology and we told them, please take this.

[00:16:16]

This is one hundred times faster. It's much, much, much easier to use. And it supports machine learning. It supports real time computations. And they just looked at it and said, now this is just academic project. What if the students quit to start their own company or something? Then we'd be left here with the software. They just ignored us. And 12, we kind of had it and was like, if we want people to adopt this, we probably have to start a company around it because it's not going to work.

[00:16:40]

Sitting here at UC Berkeley, they're not going to pick it up. You just don't seem to be getting the attention that this project deserves.

[00:16:47]

What was the kernel early customer use case that you latched on to MOISY interested with new businesses like this, what the first commercial engagements are with people that need the service? What was the origin story there?

[00:16:59]

Yeah, I mean, there was lots of lots of use cases. Industry didn't want it to do machine learning. Basically I use cases. I think they lacked first use. Case Database was a company that's trying to understand how video screens were being actually downloaded on the Internet and how they could increase the quality of this. Obviously, if you do that on the planet, we're talking massive amounts of data, massive amounts of viewers, and the quality deteriorates and improves over time, being able to in real time, making quick decisions on that.

[00:17:28]

Maybe we should follow the traffic over there instead and being able to do that with A.I. Predictive Technologies. That was the first use case that our first customer had a database.

[00:17:37]

Can you describe how a business like yours relates to the open source project? How do you think about how. Those two things work together, interrelate handoff to one another, it's a very unique and increasingly common way that a lot of developer facing technology projects and businesses are built is to have an open source component. How do you think about the benefit of that piece of the business?

[00:18:01]

We are a business, so we are a cloud company, so we manage and run software for you. In these open source projects, there really aren't problems. You download them. They have a particular version. They're not a cloud software cloud software like Google Search doesn't have a virgin. When you go in search on Google, I don't ask you which version you search on. It's just there. It just works. They keep upgrading at the cloud software.

[00:18:24]

We do the same thing. We run Spark and lots of other things these days. And Spark is a small portion of what we do today in the cloud on behalf of our customers, and we just automate that away. But that's not what we were doing the first three years of the year. It's the first two, three years. We just want to spark that because it seemed just impossible, because it was all the rage about Hadoop and people were just keep talking about the do and now it's awesome.

[00:18:45]

I remember Cloudera had just gone seven hundred million dollar investment from Intel, so it was one of the largest investments that year and the second after Hubers investment. So we were just trying to get this to take off and it seemed it was impossible. Sparke doesn't work if you don't have enough memory, which was not true. Sparkasse just got some machine learning use cases, not others or Sparke only work to get real time computations, but doesn't work on the others.

[00:19:09]

So it just seemed impossible to educate the market. None of this is true. So that was a power struggle. The first three years.

[00:19:16]

As you think about it today, how much of data breaks the business is, in your mind, related to the open source world? I'd love to hear what the lineup is outside of SPARC, the other services that you manage in the cloud on behalf of customers to sort of lay out the different methods or products or services. But just as you think about it today, how key is open source, if at all, and do you think that will change in the future?

[00:19:42]

I think open source is critical to enterprises. I think enterprises don't want to lock themselves in to proprietary software. They've gone burned since the eighties. They've bet on vendors that are really, really good, who have great innovations, and they give them all their data and they lock it in into those proprietary formats. And then those vendors, because of that lock in or that moat, they become complacent. After a while, they don't need to innovate anymore.

[00:20:07]

Eventually, the founders or the original folks move on. And at that point it's just bloated old software, which is very costly, and it can just keep increasing the price. And you have to pay more for it because it's so hard for you to move off of it because you're locked in. So I think enterprises, if they have the choice, they prefer open source because it avoids that locking. So it's critical to what we do. So pretty much every element of the database platform where there would be a lock in, we've opened it up as an open source project.

[00:20:34]

So today there's SPARC to access all your data sets and get all your data. There is Delta, which is the key project that makes your data really high quality and really performance for downstream use cases. There is a project called Amount Flow, which is really how you operationalize end to end machine learning. Finally, there's a project called Redaction, which is how you deal with all your visualizations and dashboards and things of that nature. So that's also at its core, open source.

[00:21:00]

We don't want anyone to get locked into us. We want them to pick us because we're providing so much value in our software that's in the cloud. So you've described sort of an interesting now platform, a set of tools that machine learning researchers could use to do their job. Let's boil this down to like an individual research team or something. They come to you and let's make it as simple as possible. They've got some big data set. They want to use that data set to build a prediction.

[00:21:24]

And that prediction is then used somehow in their business or for some use case to something really straightforward, something like Netflix probably is a good example for people to think about this. Lots of data. They want to use that data to build a predictive model to do something interesting. What do they need to show up with of their own? What do they need to bring to the table to then start working with data bricks? Is it just an initial data set?

[00:21:45]

How do you think about what a team or a person or researcher shows up with that then makes data BRICS in its platform powerful?

[00:21:52]

I think they need data sets, so they need actual data, which luckily almost every enterprise on the planet has been collecting since thousands. And they need to have a clear understanding of what it is that they really want to deploy or actually realize. So that requires that they understand their business. What's the most important project to have impact on the business? What's the biggest business value can provide? Is it a prediction for something for a company like Shell? It's being able to predict its equipment breaking down in advance.

[00:22:21]

If they can do that, then it can replace those parts in advance. That saves them hundreds of millions of dollars. And it's actually better for for a nature. It's better for employees and environment. For a company like Comcast, it's a use case where they have a remote control and he has a voice button, the press stop, and it can speak into it and say, hey, what's the weather today? And then that actualise is for a company like you form a company like Regeneron.

[00:22:44]

They use cases finding that particular gene that is responsible for disease. For instance, they found the gene that's responsible for chronic liver disease and vertebrates and then they can do drug testing and develop drugs to cure those. So it depends on so understanding that business. You shouldn't be on drugs if they're just showing up saying, hey, we have a bunch of data and we want to do some cool stuff, then that's usually when we say, well, what is it really you want to do?

[00:23:08]

What are the use cases that really are pertinent for your business?

[00:23:12]

How does the business itself work? One of these teams shows up. Comcast shows up as a team of researchers. Are they self-serving to starting to use the platform without really interacting with somebody at data? Breck's is it a higher touch? Is there a service model? I'm always interested by how these engagements work and I'm sure it's different based on size. But what are the ways in which data breaks engages with its customers?

[00:23:32]

I think it's different from many other sort of models because of this open source nature and because there is a completely free, open version of this called Community Edition. There are hundreds of thousands of data scientists that come and use that every day and they can just swipe your credit card, not talk to us. And there is a free version that they can use community. What then happens is that typically our sales teams start engaging with those that have a lot of usage on the platform.

[00:23:55]

And that's when we start getting more strategic and really strategic project that can help a company save a hundred million dollars. That's typically not a project that some engineer swipes a credit card and then soon they are saving one hundred million dollars for their business and paying that up its millions of dollars for that. That usually requires investment from the leadership of that company. Go in and say, we want to do this and we've got to put resources around it. So that's when we get engaged on the sales side.

[00:24:22]

And depending on if they need help, we have actual professional services to help them with it. If they need augmented services, there are size that we work with that can come in and help them with that. Then there is, of course, the platform that I can explain a little bit how they actually use it. We have please do. I'd love to read right into the platform. Typically they have the assets getting access to that data typically happens with Spark.

[00:24:42]

Spark is the technology that enables getting it all loaded into lake where they can actually start processing it. The next step is to make sure that that data actually has high quality and a structure and that it's organized in a way so that you can access it really fast. For that, the biggest innovation in the project, that database is spending most of its resources on the Delta project, the Delta project. It's also an open source project. That's where you organize your data search.

[00:25:07]

You just don't have data swamp. We just dump everything into it like. So now you have your data structured and it's really fast to access it. You might have even set it up in Delta. So that's in a streaming Real-Time fashion. So the data is getting updated in real time fashion. Now, we'll start looking at depending on your use case, you want to do machine learning, you start actually exploring building machine learning models. So you start actually looking at the data and looking at using various machine learning models to do predictions.

[00:25:34]

It's a highly iterative task to come up with a machine learning model. Machine learning folks usually don't just sit there and then come up with a predictive model in they're done. They actually typically have to iterate on it hundreds of times. So they try a prediction. Maybe the accuracy is not that good. They go back to the data, they augmented with more data. Maybe they buy some datasets to augment that to see if there is more signal they can get there and then try it out again.

[00:25:56]

So they iterate on this lot. The open source project called Emelle Flow helps them be productive. When they do that, it tracks all the models they created. It helps them governance of access, control on it. So you can actually with more flow to what's called serving, which is the final product where you're actually using it, you're using it in the Web page or using it on that remote control or calling in and saying right now, give me that prediction that I need this machine to break down or not.

[00:26:23]

That's the platform end to end what it looks like.

[00:26:25]

So it's really infrastructure for machine learning. Researchers with that in mind and very general purpose. I love the three different examples you gave of Shell, Regeneron. Comcast, you know, never knows those businesses, but very different use cases or products or outcomes, but all using the same sort of data driven research model somewhere in the middle to produce that product or service. With that in mind, how do you as a company think about making investments yourself that will earn good or high returns on capital to the future, especially because I imagine a lot of this is engineering, research, development, et cetera.

[00:27:00]

How do you think about as a capital allocator making investments internally?

[00:27:04]

Well, first, we think that this time is going to be absolutely gigantic. I'll give you those three examples on purpose, completely polar opposite use cases. And I could just go on all day with use cases like that. You can find them in every industry. It's not just one or two interesting use cases per company. If you look at the original companies that started doing this, deploying these machine learning techniques, they have hundreds or thousands of use cases.

[00:27:28]

Internal company like Uber predicts the price. It predicts the out for you. It tells you when the food is going to be ready, puts more people on the same routes, carpooling. It's not just one data science or research team. You actually enable your organization to be data driven. Then you can ask. We compete in a different way and disrupt the industry that you're in, based on that, we're looking at how can we actually enable the whole organization, how can democratize dayI, any investments we can do into the platform that enables more people in the organization to be able to use these techniques and be data driven?

[00:28:01]

That's, we think, strategic, because we think eventually, 10 years, 15 years from now, every company will have every view using data A in a strategic way. Obviously, we're not there yet on the planet. We're going through the charity life cycles. But things to make it simpler, broaden the TAM for that. That's where we're investing our dollars. That's where we're going.

[00:28:24]

How do you think about your competitive advantage versus other data companies with a TAM? As big as this is already and is likely to become into the future, big markets usually draw a lot of really smart, talented entrepreneurs and companies and use cases. Do you think about that much? Do you think about other companies going after a similar market? And if so, how you structure your business to sort of have a accumulating advantage as you get bigger versus competitors?

[00:28:53]

Yeah, first thing is make sure that you have innovative DNA in the company that's remained in there. So continuing cannibalizing yourself, continuing to figure out a better way to do things. Even if you face innovator's dilemma and convert your existing revenue base, that's OK. Cannibalize it and come up with the next thing. That's why the very small portion of what we do today is far today. It happens to be Delta. I'm sure tomorrow they'll be something else.

[00:29:18]

Building in that innovation DNA into the company is essential and enabling the company to take risks and be able to actually continue to innovate. That's essential to compete. And one thing that really helps us there is open source. We don't have a locking mode that we can just sit there and say, well, we now have everybody's data, we've locked it and they can't close the door spiral. The researchers will have to continue innovating. That's one thing. But the second thing is the fact that we actually own the whole pipeline from the data coming in all the way to the US Census.

[00:29:48]

So it sort of vertically integrated that way. That helped us a lot because you get benefits when you're all the way from the data ingests all the way to the production ization. A lot of companies, they actually cut out just one sliver of that pipeline. They're not. And there's lots of room for optimizations that you can do if you own it end to end, which is what we have. That itself is also an advantage. No others could do that to just other companies haven't done it for some reason.

[00:30:14]

What do you think the most interesting other companies outside of data BRICS are in the modern data stack? As you think about the growing market of people that have a lot of data, want to use that data to do something good or productive for their business? I don't think anyone would argue that that's happening or is going to happen a lot more in the future. What other types of companies or even individual companies do you think are really critical in moving the needle on what's possible right now?

[00:30:37]

I look at startups. There typically is A or B startups that are interested in the big company names that everybody knows. Well, I find them pretty uninteresting. I mean, they've had some innovation that's typically 10 years old. They're milking it with the go to market machine that's broadening, getting that tech out. The bigger and bigger markets with the core innovations aren't that interesting to me. The really interesting ones you find in startups that are working on making machine learning really more productive or simpler to use, how to do that in real time.

[00:31:04]

There's also a lot of interesting things happening on the visualization front. How do you build data apps that can actually have built in visualizations that you can interact with? There's lots of startups in that space to say a bit more about that second category.

[00:31:16]

What do you mean by visualizations and why is that interesting? What might that enable?

[00:31:19]

You have data driven apps where you're actually interacting with the data in the UI. It's not classic BI. It's not just a normal application. There's this new category that you're seeing emerging in the field that I think is very interesting. Is there an example of a company that people could go to just to understand kind of what you mean by this? I'm still struggling to understand the middle interacting directly with data visualizations.

[00:31:43]

I mean, there are these technologies like Dash where you can actually build data driven applications, where your application interacts with the data and visualizes and it's much, much more rich in the types of things you can do than classic by dashboards, which typically are histograms and drop downs and various types of graphs to do. Just click on here. You can actually leverage machine learning under the hood if you want it to be interesting.

[00:32:11]

What do you think is the biggest bottleneck right now in the technology or the development of computing, broadly speaking? So if distributed systems and compute and storage were like a huge breakthrough that got us through Moore's wall, what are the walls today?

[00:32:27]

What walls are we trying to break through?

[00:32:29]

The biggest wall is between data and A.I. and you see it both organizationally inside enterprises, but you also see it in the tech and just the preferences people have. So the teams that are responsible for managing all these massive datasets typically tonight the. They have to make sure that the data is secure. They have to make sure that the data is reliable. They're going to own it for the next 10 years. It needs to be compliant. So they're very conservative in nature.

[00:32:53]

And then you have the folks that are in line of business that are close to the use cases. They're the ones that are coming up with these amazing use cases that is going to transform their business. But they can't actually succeed without the data. There are companies that focus on these users, data management companies that are awesome at data management, the processing your data, making it fast, reliable governance and so on. But they have zero I built into that to their technology.

[00:33:16]

And on the other hand and the line of business, you have companies that are focusing on just ehi machine learning, but they don't have actually any of that governance data capabilities, data management, reliability, security capabilities that you need, even the preferences of the practitioners different. Typically you use something like job on the Iraqi side. That's what they like. They want it to be production ready on the other side, on the line of business, it's technologies like Python and so on.

[00:33:42]

So there is this big divide. And typically also these two teams don't even report up to the same person in the organization. So this is flowing companies down. And this is not how it was done in the former tech companies. Ten years ago, there were organized differently and they were using one tech stock for both. I think that's the biggest barrier that needs to be broken down to accelerate us getting value out of the data.

[00:34:02]

When we had a conversation with Jeff Lawson at Twilio, one of the things that stood out was how Twilio and other companies helped change developers from sort of this back of house function to very front of house, incredibly important function in a business where now developers are impossible to get. It's just a really in demand talent. I'm curious what your thoughts are on whether or not the sort of data world is going through that same transition where you see more chief data officers, more attention given to this part of any given business and sort of what you see as the best practices, whether it's a Comcast or a shell or whatever.

[00:34:39]

The examples are like the traditional firms that are doing this. Well, what do those firms share in common from your perspective? Because I think that sort of serves as advice for those playing catch up that want to use their data productively.

[00:34:52]

If you look at it, there was a lot of collecting just data. And how much data do we have on the management and sizing that up? One petabyte, too. But eventually the business started saying, what value are we getting out of? These things have changed a lot. Now, I see now that every enterprise I talk to there is this awareness from the top that this is absolutely essential. We have to invest in this. We have to get our AI and data strategy right.

[00:35:15]

If we don't just around the corner, there might be a startup probably out of Silicon Valley that's completely tech driven. It has thousands of engineers and they're just going to disrupt our business. We're going to be put out of business just the way Uber did it with the cab medallions, just the way Airbnb did with the hotels, just the way Netflix did it with Blockbuster and so on and so forth. So we're next in line. We've got to do something.

[00:35:36]

I mean, there's this urgency that's great. But a lot of them struggle on how to do it. The ones that do it well, I think have a few ingredients and a few things in common. One is they try to consolidate it under one leader or at least have some kind of center of excellence where you can get all these folks working together. The wrong way to do it is to completely separate line of business. And it it is responsible for the data and line of business is responsible for these projects, these A.I. projects that inevitably get stuck in politics between the two departments that have different goals that the business is asking them from security, reliability on one hand side and the other is business impact, the business business value.

[00:36:17]

So getting those two together, if you have a chief that officer, that's great. The title isn't important. If you have some org structure where they can actually work toward the same goal, that's really, really important. So we see that in all of the companies that are doing that well. And of course, the leader of that organization is an important person. So picking the right leader agent that understands how to sort of bring the organization along because these enterprises have 30, 40, 50 years of history.

[00:36:41]

So how do you transform them and make them sort of adopt this new kind of data native approach to things that's going to be critical. Second, typically leveraging open source to avoid lock in most of the development in the space is happening in open source. There are new machine learning models every week released by the universities. So that's really, really critical. Three Building with A.I. in mind from the beginning. Don't just collect a lot of data and brag about how much data you've collected and figure out later.

[00:37:09]

Later we'll figure out these A.I. projects. You kind of have to do it from the get go and finally, to be able to move fast and be agile. But on the cloud, running your own on prem data centers at this point, it's probably obvious it's going to be slow. You're going to be stuck with old equipment. I think most people realize that now with the pandemic, but more importantly, having a multi cloud strategy, because there are right now three or four cloud vendors and they're all big companies with a lot of capital, they're not going to go away.

[00:37:37]

They have different strengths and weaknesses. Make sure that you don't put all your eggs in one basket and have a multiple strategy. Those are some of the ingredients we see. We. They work well for the companies that are successful with that, and I have been the CEO of the business now for a while. How would you sum up the major lessons that you've taken away on what it means to be a good CEO? Extra points, if its lessons learned by doing something the wrong way and figuring out how to build a large and an important company.

[00:38:02]

First of all, the company goes through different phases. So you need to be good at different things, at the different phases. And actually sometimes the things that you have to be good at that one phase actually hurts you now. So you have to kind of transform at each phase. First phase was really a product market fit face understanding really what enterprises need and just making sure that you're building the right thing for them. So the first phase, the second phase was, OK, we figured it out.

[00:38:25]

This is what they need. How do we scale the machine? So bringing in the pros that really could scale the machinery. And third phase is, I would say, an optimization phase where you now have so many people in the organization with thousands of employees that you have to make sure that all the processes are smooth and they just work. And you can just drop in new employees that come in every day and they'll become big stars, as we call them, and they'll do things the way bookstores do things here.

[00:38:51]

They'll come up with new innovations and they do things that they have a way that requires much, much more process. If you actually compare the first phase and the third phase, things you need to be good at in the first phase actually hurt you at this phase. First phase, we said the co-founder, be an owner. If you see trash in the kitchen, even if you didn't put it there, you clean it up. It's your company.

[00:39:13]

You're an owner. Actually, at this phase, I would say don't pick it up. Let's go back and figure out how did it end up there in the first place. And let's put in processes to make sure that that never happens again. It's just a little bit of a different sort of mindset you have to do at scale. And the tradeoffs for me, probably the biggest lesson is how important trust is and building trust with leaders, nothing you can do overnight.

[00:39:36]

It ultimately, especially as the organization grows, finding leaders that you can trust and you can work together to drive the company because you can't do this alone. You need other really strong leaders. That's probably the biggest lesson of how important that is and how time consuming that is and how difficult that is to go back all the way back to where we started, the origins of all this, the environment of Berkeley, not just Berkeley, but of research centers in the US and around the world.

[00:40:02]

AMPE Labs specifically comes to mind. Maybe just say a few words about how that all works. And I'm sure most people haven't heard of Vampi, for example, that are listening. What kind of work happens there? Why is that such an important source of progress and innovation in the world of technology?

[00:40:19]

Berkeley had a special way of doing things. I would give a lot of the credit to one particular professor who's now retired. His name is Dave Patterson, was also now touring award winner, which is kind of the Nobel Prize for us in tech, the closest we can come to it. He had this mentality that we need to really work together and collaborate. That's evidence of that. At some point he said, let's have all the students and professors sit together and actually have students from different backgrounds work together.

[00:40:47]

Administration said, no, we don't have room for that. There are not enough rules. And he said, hey, so me and my fellow professor, we'll just give up our own gym. We don't need to give those up and let's build an open space. We're all working together. That has been sort of the successful formula that Berkeley has applied over and over. And many of those students have gone off and become professors in other schools. They're applying the same model.

[00:41:08]

It's a very collaborative model. Had that not existed, for instance, Spark would not have been developed because it happened because two people that were sitting next to each other, one was this machine learning mathematician and the other was a systems researcher. So they actually work together. It's this interdisciplinary way of working collaboratively. And also it was very sort of in close collaboration with industry and taking problems that exist in the industry and solving those in the research lab.

[00:41:36]

It's a very pragmatic, collaborative approach, both with industry and across the researchers and professors, different sort of seniority. I think that has had a huge impact on Berkeley, its research over the many past decades, but also data. Frankly speaking, one of our four cultural principles is teamwork makes the dream work. So it's a very highly teamwork oriented culture.

[00:41:56]

What are you most excited about in the future of technology? What are kernels of potential change and innovation that you think are most interesting today?

[00:42:04]

The most interesting thing is I think the merger of two markets, two platforms that are going to merge technology for A.I. and data science, it's going to merge with technology for data warehousing and data management. We even call this paradigm the lake house paradigm because it's portmanteau two words. There are lakes which typically are used for A.I. and that our housing, which is used for data management and data. So data plus I like Lake House. I'm excited as that develops.

[00:42:33]

We've been saying it for many years, but in the last year or so, year and a half, we've heard a lot of other really large companies also get behind this and talk about the lake house pattern. So realizing that. Because I think no one is fully there yet, including terrorists, realizing that I think is going to simplify things a lot for enterprises and get closer to what the Silicon Valley for tech companies had in the early 2000s. They were able to build it all from ground up with the specific use cases in mind of their business proprietary to their business.

[00:43:01]

And they had thousands of thousands of engineers. The rest of the enterprises don't have that. So building this and providing this to the enterprises, I think actually will transform all these businesses over the next decade.

[00:43:12]

Well, I really love learning about your business. And I think for people that have heard that term, big data, maybe they hear a little less than they did five years ago. But this puts so much context around how distributed systems made that possible. After Moore's wall, I actually hadn't heard that phrase Moore's Wall. So it's great to learn about that. And just how data breaks as a company is playing into this ecosystem has been fascinating. I ask the same closing question to everyone that I interview and still ask you as well.

[00:43:36]

That question is to ask what the kindest thing that anyone's ever done for you is.

[00:43:40]

I think the kind of thing I'm really thankful for all the health care workers on the personal side. It's been a very difficult year for me and my family because my 12 month old son got diagnosed with cancer during covid. We've been in the hospital. And the interesting thing is we actually were able to detect this cancer by screening him every three months because we had done genetic testing on him. And we have found the genetic marker that actually says that he's highly predisposed to this particular cancer.

[00:44:07]

This would not have been possible 10 years ago. So he would have probably had this and would have found out much, much later. So I'm thankful to all the health care workers. I'm also thankful to technology that actually enabled us to be able to actually spot this gene in advance and screen him every three months and then find the cancer so fast and hopefully make him live a long life. That's what I'm most grateful. It's an incredible answer.

[00:44:30]

I am thankful for you sharing with us. And obviously we'll be thinking about your son. We have learned so much from you today. Really appreciate the time and all the energy and what you're building. Thank you.

[00:44:39]

Thank you so much, Patrick. This episode was brought to you by Cliffview in this four part mini series, I sit down with Cliffview customer Nomad and discuss their origin story, why they chose Kellyville for their business and how your brand can grow online sales with Clavius. E Commerce Marketing Platform. In this week's episode, Nomad marketing director Chuck Melber and I discuss how easy it is to get up, running and growing with Clavius marketing platform. Plus, Chuck shares his advice for other marketers out there.

[00:45:07]

Chuck, I'm curious what it feels like if you were setting some new product up on Clavius. Some new e-commerce site has launched a widget business tomorrow and I partner with Clairvaux. What would it feel like that first week setting it up? What process do I go through? So they're perhaps pretty easy to get set up and automatically start sending data back and forth, which is the most important part of this marketing puzzle. After that, you actually has a ton of built out flows or pre populating flows for you to start working with the logic already there.

[00:45:36]

The inspiration's there for you to work with. You can be a marketer selling product X, jump on KBIO, not really know much about browser card abandonment. Get it set up pretty easy. They have a nice with you Big Ed drag and drop. If you're really good graphic designer you could of course design your own stuff and bring it in or do your own HTML. But I assume most people that are getting into the e-commerce space are probably someone like me who is more into the lazy way editors and theirs works quite nice.

[00:46:02]

Makes it easy across the time. Check that you've used Clairvaux as a marketing person. What has it been like to work with them? What does it feel like to work with Collabo? In what ways do you tend to interact with the person or not? And just kind of a felt sense of being a clavier customer? Yeah.

[00:46:18]

So in my six years as a nomad, I've worked with a number of different companies. My experience, Octavio's, one of the better ones out there for sure, working with reps on a one to one basis, but then also looking at their documentation that's available. As you guys well know, documentation can get rather crazy and cumbersome very quickly. They've done a good job of distilling it to be super digestible for someone who's not super tech heavy, does a great job of just making it easy for me to find my answer and then make my solution or or fix the problem if I'm having one check.

[00:46:48]

Any closing advice that you would have for other people? Obviously, e-commerce has exploded during twenty twenty and not just existing companies, but the number of new companies selling something online. Any advice you would give to especially the people responsible for marketing and getting the word out and building a thoughtful marketing organization, especially to new and young companies? Experiment tests like come up with a hypothesis and give it a shot. Don't be scared to try something out, even if it seems kind of wild off the cuff.

[00:47:15]

We're still experimenting with that kind of stuff on a day to day basis. Email and social media are both two platforms that are somewhat fleeting in nature. Like you send out the letter, people see it or they don't. Because of that, you're able to experiment a lot and try out different things because it's not going to live on your website forever. It's not going to be something people see every single day. So if you have a harebrained marketing idea and you think it might work, give it a shot or no.

[00:47:40]

To find more episodes or sign up for our weekly summary, visit, Investor Field Guide dot com. Thanks for listening to Founders' Felgate.