Velocity NY 2016 Session – Using Machine Learning to Determine Drivers of Bounce and Conversion

There has been a lot of historical work to look at the relationship between performance and conversions but most of it has been after the fact or relied on linear models. By combining SOASTA’s experience and data with measuring performance for real users and Google’s experience with deep learning we have been able to produce a much better model that can be applied to any site and help answer questions like:

  • How much faster do I need to make my site to see a measurable impact in conversions?
  • What would the ROI be for improving the site performance?
  • How much is slow performance costing my business?
  • At what point should I expect to see diminishing returns on improving performance?
Speakers: Tammy Everts, SOASTA | Patrick Meenan, Google




Tammy Everts: Don’t be alarmed by the fact that it says part 2, you don’t have to have read or seen part 1 to understand it. We did blog about it, there’s posts Soasta blog and the slides are on slide share somewhere. Really what this talk is about is, as Steve said, we started this project about a year ago and really started doing the analysis because just managing the data and dealing with that is a huge undertaking. If anybody has ever tried to manage a lot of data, they’ll understand and appreciate how hard that can be. We kind of crunched it and did a lot of validation on it and came to Velocity in Santa Clara with a handful of what we thought were interesting findings. Some of them controversial. We had questions about some of the findings ourselves, this is a process and it’s kind of a never ending process. Pat did the heavy lifting on going back into the data and filtering it some more. We’re going to talk about how got from before there to there to here. Does that pretty much sum it up?


Patrick Meenan: Yeah, I think that’s pretty good. Mic works. Wasn’t entirely sure about that.


Tammy Everts: If you have any questions you can Tweet at us, we’ll Tweet back. I’ll let you take over.


Patrick Meenan: I’ll just get this out of the way really quickly. Machine learning, as far as it goes for [Rung 00:01:29], it’s very much a case of using a sledgehammer to crack a nut. Every time I’ve talked to any one of the people in the brain team or anyone at Google who really does machine learning for a living, they laugh at me when I tell them, we have 93 different inputs we’re looking at. They’re like, run that on your phone. I’ll be the first to admit, I am not the best with analytics in general, I am not a statistician. Someone who is and someone who is a real data scientist could probably find non-machine learning ways to do a lot of the stuff that I’m doing here. I’m a code monkey, so I like throwing machines at the problem and I let machines figure everything out for me.


  First off, everything we’re going to be talking about here is all open source. There’s not a whole lot of code, just to get that out of the way. Machine learning for whatever reason is all in Python. I guess everyone in the machine learning community, all of the libraries and everything are all based around Python. When I say all of the code, the real work in machine learning is in preparing the data to get it into a way that you can do the training. With all the machine learning libraries we have these days, the actual code, or doing the machine learning, is like 10 lines of code. I’ve got 10 lines of code up on GitHub.


  Just really quick basics and a little background just so you sort of know where we’re coming at. With the machine learning, there’s a whole bunch of different things. We’re largely looking at two aspects of it. There’s the deep learning. Which for the most part, neural nets, that kind of a thing. It’s basically, you take all of the inputs on the left, the little circle dots, and they’re using machines, they’re trying to copy what the brain does with synapses and neurons. They basically go okay, we have these inputs, we connect them to everyone else and then we take all of those and connect them to all of the others until we have an output at the end. Those connection in between, we’re just going to guess like crazy and keep guessing until we find something that gives us the answer we want.


  That’s all machine learning really is. It’s brute force the crap out of it until you find out what those relationships are that happen to give you the answer you want in the end. That’s why it’s embarrassingly parallel, it runs really well on GPU’s and that kind of stuff because those work really well on parallel problems. Throw all of the numbers in, it’s largely just multiplications as you go through, and did you get the answer you wanted or not.


  Random forest is the other machine learning algorithm we’re looking at. It’s also largely guessing, but instead of trying to map and mimic the neural nets of the brain, it’s basically a decision tree. What it’s doing is it’s randomly guessing at what the decisions are until you get an answer that you like. A random forest is basically throwing a whole ton of decision trees at the problem until they sort of all kind of agree and get you toward the answer that you wanted. In both cases, it’s basically just throwing a whole bunch of math and a whole bunch of cases at the problem to get you the answer.


  I mentioned in machine learning the biggest part of preparation is getting the data into the format that it can actually do something with. Phase 1 on that is largely turning everything into numbers. In the Soasta data set, we’ve got things like, these are Samsung devices, this is the user agent strain, these are mobile devices. Those aren’t numbers so one of the things you have to do is turn all of those string things into numbers. Largely, everything needs to be a 1 or a 0 or an input in a range between 1 and 0. What we end up doing is you have a pass that goes through and goes okay, Apple devices will set this one bit here as a histogram, yes or no? Is it a Samsung device, yes or no? You have to be really careful, like UA string in particular is one we left out because we have the UA string problem of explosion, thank you Microsoft, and putting every version of that net in the UA string.


  Since they turn in to histogram, every different distinct value of one of your strings ends up being a different input so you end up with 93 data points ends up being like 10 million different data points if you’re not careful. Luckily we have the code that boils user agent strings down into things we care about, platform device, that kind of stuff, and we take those as inputs so we don’t have that same problem. All of this code, it’s also in the data set, it’s more than the 10 lines on the machine learning side of things. It’s not really all that exciting.


  The other thing you have to do is you want to balance the data. We were talking earlier about another case where you can use machine learning to do anomaly detection and things like that. In the data set in the conversion for example, there was a 3% conversion rate on the overall data set. If we just threw the data as it was at the machine learning, it would be really easy for the machine to go and always guess no, it didn’t convert, and all of a sudden we have a model that’s 97% accurate just by always guessing no. You want to be really careful and watch out for that and as close as possible, try and get a 50/50 data set on your inputs so it’s really learning what makes the decisions.


  This is another case where you want to make sure you have a lot of data helps. Thank you Soasta for all of the data. When you only have a 3% conversion rate so you pull out 3% yes and then you only pull out a random sample of 3% of no, you’re looking at 6% of the data now that you’re doing the training on, so having lots and lots of data really helps in that kind of case.


  Finally, the data needs to be as close to normally distributed as possible for machine learning to do it’s thing. Luckily you don’t have to know how to do that, there’s Python libraries that do that for you as part of SciKit, so it’s like 2 lines of code. Important here as you’re doing this, we’ll talk about this in a second, when we take the training set and we do the fitting, up at the top we have scale or equal standard scaler so we get a variable that ends up maintaining what the mapping is to convert the data set into a normal distribution. We need to hang on to that for later because when you actually do your feeding of data into the trained model, you need to re transform your data as well to make sure it’s on the same distribution and everything, so you want to make sure you store what you’re transforms are that you did to your data as part of the training.


  Finally, it’s really easy for machines to learn patters, you want to hold back some of your data and not give it all of the data and train on that. We usually use 80% of the data to let it train and figure out the the patterns are. Then we take 20% of the data its never seen before, throw that at it, and see how accurately it detects. That was you can make sure it’s really learning the relationships and not just memorizing the patterns inside of your data set.


  Finally, at least on the training side of things, you want to be really careful and it helps to know the data set you’re feeding it, and the relationships and what all of the metrics really mean in the real world. When we first started out, we threw all of the fields and we’re like hey, let the machine sort it out. We ended up in a case where the number one correlation with conversions was SSLU’s on the session. Well, okay, once we figured it out it’s like, well duh, check out pages of the SSL, so yeah SSL’s going to be highly correlated with checkouts. You want to be careful and watch out for giving it inputs that really are more on your outputs than things that you’re trying to measure and train against. We pulled SSL as part of the session out of the data set and fed it all back in. We’ve removed a couple others as well. Not bouncing, the overall session length was highly correlated with not bouncing. Once you kind of think about it, it’s kind of a no brainer.


  I’ll jump through this really quick because especially in the back of the room, this is probably an eye chart. It’s 1 line of code to train a random forest. 1 line of code with 20 different parameters, I’d almost think it was pearl code but for the most part, you just give it your data set with a bunch of tuning options and you say, okay, go fit and tell me when you’re done with the training.


  When we did part 1 back in Santa Clara, we could only get useful … What we really want to know is which ones of these metrics drive conversions or bounce. For us, that’s kind of the money result is okay, out of all of these 93 metrics, which ones do we care about? Which ones do we want to look at and focus on? When we first went back and I talked to the deep learning guys, they’re like, well we told you how to do the deep learning for prediction, you want to understand what it’s doing? No. You can’t do that with deep learning. The first pass, we ended up relying on random forest which does let you get some level of [inaudible 00:11:16]. Luckily it’s also just 1 line of code. Actually it’s not even 1 line of code, it’s a variable that’s already stored for you. Training the deep learning, it’s 2 lines of code. You have to compile the model and send the data to it.


  I mentioned deep learning, it’s very much a black box. You give it a whole lot of inputs on your training it figured out what the relationships are, magic applied, on the other side you get your output your result. We did figure out a way, given enough time, to get it to give us what we anted. Lots of machines and brute force. What we ended up doing is we took each one of the 93 inputs and we trained 93 models separately with each one of the inputs to see how accurately each one of those came out as a prediction. Once we had, okay, the number 1 feature that got predicted or drove the most accurate prediction was X, let’s try training 92 models with that feature plus every other combination of every other feature. Lather, rinse, repeat, keep going. You end up training, depending on how many features, you need that end up driving what you’re looking for, you could end up creating tens of thousands of combinations of your testing to see what works. Luckily it ended up, I don’t think any of them we needed to go beyond 2 or 3 features combined before it started diminishing returns on the training. Luckily in both cases we ended up with, these are the 3 features that really are involved in making a decision on whether the sessions going to bounce or convert.


  Then one of the really cool things we could do once we had the deep learning model is we have a trained model that can predict based on 2 or 3 inputs whether a sessions going to bounce or convert and it’ll tell you what the probability is that that sessions going to bounce or convert. We could take a synthetic set of data, in this case, I took every 100 milliseconds interval from 100 milliseconds to 20 seconds for each one of the inputs that ended up driving. Put that into a prediction model, and have it spit out what the probability is for conversion or bounce for all of the inputs on that range and you can plot out and now we can all of a sudden see what the impact of varying each one of those inputs is on bounce or conversion. I’ll let Tammy give you what you actually came here, for which is the resulting data.


Tammy Everts: Huge caveat before I proceed with this, because I think that I didn’t make it clear enough when we presented this in Santa Clara, is that this is what we learned using our massive set of aggregated data from a lot of different retail sites. For instance, when we presented this a few months ago, one of the things that we talked about was the fact that Start Render didn’t seem to be a meaningful predictor of conversions. Everybody was like, oh my God, what, Start Render? I measure that, that’s really important to me. What we find here and everything I’m about to tell you is just what we learned from our data. The whole point of this exercise and the reason why we’re modeling how we did this, is to just kind of show you how it’s done, how we learned, how we go back and learn some more, and do this cyclical way so that you see that this is something that you can do and kind of follow the same process of questioning assumptions, that sort of thing. Our actual findings, you shouldn’t kind of take to the bank and say these are metrics that matter to us because there’s one thing I know from having done a lot of different research a lot of different verticals, and a lot of different companies within those verticals, is everybody is different. No [unicorn 00:15:14] metrics here, sorry.


Patrick Meenan: That’s probably one of the reasons we have the code up on GitHub and not a pre-trained model for you or anything like that is run the code or a equivalent of the code on your own data set and see what it means for your business and your metrics. Soasta has a lot of customers and a whole lot of data but it’s kind of like a smeared vision of things and your specific case will definitely be different.


Tammy Everts: Just a little bit about what’s in our beacon. We use Impulse to gather all of our data and we keep all our data in Perpetuity which is why it’s so handy for settings like this. If people ever ask us why do you keep all your data, this is one good reason, so that if Google comes to you and says hey, could we use your data for a cool study, you can say yeah, and do that. As Pat mentioned, we looked at 93 different features that the Impulse beacon collects. The beacons actually built on Boomerang, I don’t know if Phillip is still here, Phillip tell us … There he is in the back. He’s the guy that’s responsible for Boomerang and [inaudible 00:16:22] open source project. This is just a handful of some of the things that we captured. If you want to learn more about it you can follow that link and the entire list will be there.


Patrick Meenan: All the slides will be up on the presentation as well for downloading. You don’t have to write the link.


Tammy Everts: We’ll post them this afternoon. They’ll be there and we’ll post the link on Twitter. What we found, I’ll just give you a little teaser. It was really exciting because I know Pat doesn’t like to give the same talk over and over again, so I was the one who suggested that we talk here. We hadn’t even done the first talk yet and I was already trying to get you to commit to the second one. Fortunately, for Pat, and I think for the sake of this kind of research in general, we found enough things that were the same as what we found in our first pass that we weren’t going oh my God, we’re idiots. We found enough things that were different, but we know why they were different, that it’s interesting and relevant, and Pat is here and actually wants to talk about it, and then a couple of new things that just weren’t there the first round either. Things that we didn’t notice in the first round.


  When we presented in Santa Clara, one of the things that we talked about was that when we looked at all 93 metrics, they all seemed to matter. They all effected conversions in some way shape or form. Here, what was really interesting was we found that not necessarily everything was a great predictor of bounce rates.


Patrick Meenan: I was just looking at the scale. It’s important when you look at some of these scales and data, remember I said if you guess no all of the time on a 3% data set, 97% would be just always guessing no. 50% is the baseline of just tossing a coin, so 50% is really the 0 level on how accurately any one of these metrics are driving anything. The fact that you have these really tall green lines, and I think this is probably from when we have 2 metrics of baselines around 70, so the really tall green lines, it’s really the top edge of that green line doesn’t add any value.


Tammy Everts: Does that make sense? I’ve kind of highlighted the area so it’s almost a third of the metrics didn’t … Relevant for the things, they just weren’t great predictors of the 2 things that we looked at which were bounce rate and conversions. The reason we looked at bounce rate and conversion is because they’re pretty meaningful metrics for most people. Regardless of what kind of site you have, you probably care about people staying on it. Then obviously for retailers, conversions really matter. As you can see, the ones that really matter, that tiny little [inaudible 00:19:24] off to the left of conversions, when we actually look at the first 2 of those we found that the ones that were most meaningful in terms of bounce rate were Dom ready, you know it as Dom content loaded, and averaged session load time. Here we saw that combining both of these metrics, as Pat said before, doing that first pass and then doing the second pass where we looked at Dom ready combined with load time, that got us up to 89.5% accuracy. You sort of mentioned this in Santa Clara, what were the percentages where we considered it actually relevant?


Patrick Meenan: Anything over 50 is an improvement. 90% is really good in general. To be able to do that with only 2 metrics was actually … I expected to have the machines running a whole lot longer than we had them running and need to go into like 5 metrics to be able to go this high. I was really impressed. The machine learning guys were very, very happy to see prediction rates in the 90% range.


Tammy Everts: That’s what I recall from Santa Clara. The thing that’s interesting, I don’t know if you can see this at all, but the blue line is Dom ready and the red line is page load as those 2 metrics. When you look at them, you see that it kind of just goes up into the right in a good way, we like that. Where you see at around 1 second, that’s when you start to hit the climb and at just around 4 and a half, 4.8 seconds is where you kind of get in to the probability increasing further. What’s meaningful about this is that you see that as page load slows down, you see that it has that higher impact on bounce. The slower the pages get, the more bounce increases.


Patrick Meenan: What I found really interesting about this is, you really need to be under 4 seconds or the Dom content loaded, I think it was around 6 seconds for the page load time. That’s still not good, you’re not in the plateau of you’ve already lost them anyway. As you get faster, there’s no sort of point where you’re done. It’s almost a straight line down into the left, the faster you get, the lower your bounce rate. Actually once you get down below a second or so, any incremental improvement you can make is an even more significant improvement in your bounce rate. Granted, I know there are points of diminishing ability to improve those performance numbers where the engineering effort is maybe not worth the improvement you get but from a straight, at least from what the machines are telling us on the predictions, this is all the probability of bounce for any one of those times. Pretty much the faster you can go, the faster it’s going to get and it’s almost a straight linear relationship.


Tammy Everts: The thing that’s interesting to me about this particular graph is it [inaudible 00:22:54] other graphs that I’ve done in similar research where I’ve used the phrase performance poverty line before, the point at which your conversion gets so low or bounce rate gets so high, it just kind of plateaus. It doesn’t matter much if your page takes 6 seconds to render or 15 seconds to render, it still matters obviously, you don’t want your pages to render in 20 seconds please. It’s when you get below that 5 or 6 second mark where you can really see greater returns on your investment and performance.


  This was really interesting too. We found that when it came to getting high predictability, we had almost 90% on the bounce data, it was a lot tougher to get high predictability on the conversion data. With this set of numbers, we had to make 3 passes and connect 3 different metrics to get something that added up to something meaningful. Here you can see sort of the same idea, things get meaningful at the top of the red bar. We got a high of 81% prediction accuracy which is still good, but not as awesome as the 90%. We were less enthused about this one.


Patrick Meenan: That’s why I’m more excited about the bounce data. Both because it’s all about performance and because it’s really highly predictive. Bounce pretty much applies to every industry and it’s really easy. If you’re publisher, bounce makes a lot of sense, if you’re eCommerce bounce also makes a whole lot of sense.


Tammy Everts: I’m not surprised that this was a little bit harder to call because one of the things that’s challenging on the retail run side of things is, it’s all apples and oranges, right? This retailers conversion rate is very different from that retailers conversion rate. Even how they define conversion so it’s tougher to measure and anybody who works in retail who is in this room will probably say the same thing. 81% was pretty good and I will go into which metrics were specifically used. It was really the one that dominated, that was finding number 4, pages with more scripts were less likely to convert. If you were in Santa Clara, in Santa Clara we did the first pass over the data, we found that pages with more scripts were more likely to convert, but then with the filtering that we did, we filtered out basically some of the things that, I think, correct me if I’m wrong, the machine learning had kind of learned about the wrong-


Patrick Meenan: While we were looking at the random forest, we didn’t really have more or less so we couldn’t plot out the prediction rate so we were just kind of going oh, okay, more scripts. We didn’t have the data to tell us which way and which direction it mattered.


Tammy Everts: What this looks like is really interesting. You can see that the X-axis here is all of the number of scripts. I don’t know if you can read this in the back.


Patrick Meenan: The fact that it goes into three digits should probably scare the crap out of you. Just in general.


Tammy Everts: Not a significant number of them do.


Patrick Meenan: It goes up into four digits at the end there. These are script tags not necessarily external scripts, but that’s still a whole lot of script tags on pages.


Tammy Everts: You can see that before around 300 scripts is possible, the machine learning learned some of the patterns of what checkout flows look like. That was what you said in terms of explaining the randomness. After around 300, it evens out and gets into a pretty nice predictive model where you can see that as more and more scripts are on the page that it hurts conversion.


Patrick Meenan: This one for me, as long as you don’t go nuts with your scripts it’s fine, but if you’ve got more than 4, 5, 600 scripts on your page …


Tammy Everts: Don’t be on the right half of this graph if you can help it. The other thing we found was that the number of Dom elements mattered a lot. Here’s what that looks like. You can see that your sweet spot is between 400 and 700 and just plateaus out. Then after 1,000 but really after about 2,300 it just kind of plummets. That’s kind of the painful area right there. We’re looking at pages with 9,700 or close to 10,000 different Dom elements so same idea, you don’t want to be on the right half of that graph, or really anywhere in this red zone if you can help it.


Patrick Meenan: Both of them it’s basically, try and keep complexity sane. It’s not even minimize it, once we get up into some of these numbers, you’ve gone off the deep end anyway.


Tammy Everts: The other thing that was consistent with what we found in the first pass was that mobile related measurements, those metrics weren’t really meaningful predictors of conversion. This is really interesting to me. We looked at, for example, this was all 93 different page metrics. Median bandwidth is kind of around the middle, device type and mobile connection type are down there at the bottom. This is for bounce, I could show you a very similar one for conversions, it was the same thing. They were actually even all a little bit lower down, median bandwidth was even lower down on the list. I guess the take away there, we talked about this a little bit a few months ago, was that it’s not mobile web anymore, it’s just web. People expect experiences to be speedy, people bounce at pretty much the same rate regardless whether they’re on desktop or mobile. You can’t really segregate out user behavior and think that okay, people behave this way on desktop and this way on mobile. They just behave the way they behave across the board.


  This was kind of the biggie because this freaked some people out. Somebody had a talk before us in Santa Clara where we said that Start Render was 69th out of 93, but actually when we went back into the data and filtered some things out, realized that the issue there was that our beacon gathered Start Render data, but it’s not supported by our browsers so when we filtered out the browsers that didn’t support it and just looked at that data, it was something like 280,000?


Patrick Meenan: Yeah, Safari and FireFox, neither one does any Start Render so if half of your browsers aren’t supporting the metric, it’s not useful largely as a predictor. This is another case where it really helps to dig in to the data and know the data and also know what all of the metrics are. You don’t want to just blindly throw it over to an analytics team and say go figure stuff out with this data. When you see something like this, Start Render not being important just floored me when we were at Santa Clara, I was like okay, I’m just quitting this industry I’m going home. All the work I’ve done on web page test around visual metrics, it’s like no. That’s probably what drove me into actually digging into each one of the records and looking at the data and saying what? Why? When you get something that ends up being counter intuitive, definitely dig into it and see if you can figure out why. It’s not unusual for your sourced data to not be quite as clean as you would hope especially in a [run 00:31:02] situation where you’ve got stuff coming in from the field. You get all sorts of crazy things in there.


Tammy Everts: You raise a really good point. This is the value of having a cross disciplinary team. You need people who understand the metrics that you’re capturing, you need people who understand the business metrics as well, then you need people who understand how the browsers work. You have to have people coming into it with their own areas of expertise and actually speaking to each other so you can filter these things out. As Pat said, if you see weird things, you need to be able to go and dig in and find out okay, why is that weird. We actually found that Start Render shot right up to the top of the list.


Patrick Meenan: Close to the top. It’s important to note that Dom content loaded still ended up being a more important driver than Start Render even when we had both in the same data set, which is good since Dom content loaded is available on all browsers. If you think about it, it’s not terribly surprising because it tends to include a little more of the user experience. It’ll also include when the browser becomes more interactive. You’ve generally hooked up most of your JS to the UI and things like that. Dom content loaded being more of a predictor than Start Render, it’s just because it’s a little more of an inclusive metric about the user experience. It helps that it actually makes sense and I can rationalize what came out of the model.


Tammy Everts: This reflects my world view.


Patrick Meenan: I feel good.


Tammy Everts: All right, so I’m going to pass it over to you.


Patrick Meenan: We talked a little bit about SSL correlated and why we had to pull some of the things out. One of the other things when we were training the deep model, is you also kind of want to watch out for is it just sort of learning the patterns in the conversion, and this is sort of one of those reasons I’m also not as excited about the conversion as the bounce, it’s like hey, the maximum server response time ended up being the number 1 predictor in conversion, and it’s like okay, it’s basically just telling us that checkout flows are slow. The slowest step in almost any flow like that is going to be the credit card validation. One of the things that you can do, and we just haven’t had a chance to do this yet, to mitigate some of this, I wanted to take a look at a data set where it’s just look at the initial landing page the users came in so you have kind of a clean read on the data set and did the session end up in a conversion. It won’t necessarily give you as much value about looking at the whole pipeline, but at least you can have a clean read where the checkout pages aren’t included in your training set. Maybe there will be a part 3, we’ll see.


  The other thing, especially when you’re doing the brute force machine learning iterating when the first metric that came out was the median Dom content loaded time, on the second pass the top metric for predicting bounce or not was max Dom ready time. It’s pretty easy once you think about it to figure out that, if the median and max are different, there were 2 page views, it didn’t bounce. You have to sort of mentally look at and make sure all of the metrics make sense in the case, when we were doing this, I’d do any future pass that included previous metrics, would not include a metric that was related to one of the earlier metrics. In the second pass, anything that was Dom content ready or Dom content loaded would just get X-ed out and wouldn’t count and we could look at the other metrics. Watch out for the relationships between the individual metrics and the machine just learning what that relationship is rather than the actual behavior. Back to Tammy.


Tammy Everts: We sort of covered this earlier, use your own data, do this at home. The question that somebody asked me earlier today when I was talking about this a little bit, I never really thought to answer it because it seemed really obvious to me, but it was like why, why would I do this other than well you have a lot of data, it’s there like Everest, you just climb it because it’s there. Two big reasons why you can do this, we talked about this earlier, Pat, was one, is you can actually figure out if you’ve got this beacon data and your gathering data around all these metrics, what are the metrics that you can actually get people in your team to care about so that everybody’s kind of running toward the same finish line. Two, the predictive aspect is, okay, did we know this about … We’re heading toward the finish line for these two metrics, what difference will it make? The predictive aspect is because it’s not correlative, it’s predictive. If we make our pages 1 second faster, 500 milliseconds faster, whatever, we’ll probably see this return in terms of bounce rate and conversion. It’s just to kind of give you some certainty in your team about what you want to do.


Patrick Meenan: And for setting the goals. If you want to pick, okay, we want to improve the bounce rate by 50%, where do we need to improve our performance needle to get there?


Tammy Everts: Then get your own data. If you’re collecting [run 00:36:24] data and you don’t know what do with it, somebody made the point yesterday that people are sitting on all their data, they only use 5% of it, this is how you can use that other 95% of your data. Pat’s giving you the code to do it. Then if you get unexpected results as we did, keep digging. Ask different questions, run different filters, make sure that you understand all of the different metrics and you have this 360 view of every single variable that you’re looking at and just keep asking questions.


Patrick Meenan: You don’t have to write your own and collect your own [run 00:36:58] if you’re using analytics, Impulse, just about any of them will provide a way for you to get the recording data to get your raw data to be able to do this kind of thing on it.


Tammy Everts: Exactly. Thanks and please go online and rate this talk. We would really like to get your feedback. Thanks. We’ve got a couple of minutes for questions. There’s a break after this, right, Steve? If people have more questions we can hang around a little bit. Great.


Patrick Meenan: Jim.


Jim: [inaudible 00:37:34] a lot of metrics [inaudible 00:37:40] you can have a lot of scripts to work at.


Patrick Meenan: Did we look at the correlation between the metrics themselves? Not a whole lot. Although I will say the charts we showed were somewhat simplified. We had the Dom content loaded as 1 line and we had the page load time as a separate line. Those are if we looked at those independently. When you end up with the deep model and to get to the 90% we actually had 2 inputs. It’s a single model with 2 inputs not 2 models with 1 input. We then actually end up with a 3D plot of both variables at the same time because they will interact with each other. I didn’t have time, and I don’t think it would actually graph that well on the projector, but if you’re really doing it, you really want to have an go okay, we need to lower Dom content loaded by 3 seconds and page load time by 2 seconds to get to whatever this bounce rate is. Really it’s the combination of both of those. We didn’t look at the correlation between every one of the metrics though.


Audience: [inaudible 00:38:48].


Patrick Meenan: Have we applied this to Google search page?


Audience: [inaudible 00:38:57]


Patrick Meenan: Not explicitly. The search team already has a lot of metrics and I think we’ve already talked about previous velocities about every hundred milliseconds improving by X% on searches. Largely they don’t need to be convinced to improve performance, they track it like you would not believe. They watch the business metrics as it improves. Their goals are faster, always faster. They don’t actually have an explicit target. I will say, we’ll probably look at doing this. I know double click did a study, they haven’t run the machine learning on it yet, they did a study with analytics that they published a few days ago maybe? That had largely similar correlations. They had data scientists look at using analytic techniques. We’ll hopefully be able to do similar things with the machine learning and get stuff out of there as well. But, no. Haven’t run it on the Google search data itself.


Tammy Everts: This isn’t the question that you asked but, I used to do [inaudible 00:40:01] the fact that we’re working on machine learning at Soasta so actually making that … there’s a whole data analytics layer that sits above Impulse so that’s something that we’re actually working on and there are elements of it in our latest release which just went live this week.


Patrick Meenan: I should say, anything we learned on Google search would probably be useless for everyone else. Not the least of which is Google search’s entire goal is to bounce as quickly as possible which is kind of the opposite of what everyone else on the planet wants.


Audience: [inaudible 00:40:39].


Patrick Meenan: This is all retail industry. Well, it’s a combination of retail and publishers. It was a sampling of all of your data.


Tammy Everts: Yes.


Patrick Meenan: Heavily skewed toward retail.


Tammy Everts: A lot of our customers are retail customers.


Audience: [inaudible 00:40:55].


Patrick Meenan: Did we look at subsets? We threw all of the data at it. We did have the vertical and that kind of things where separate inputs ended up not being something that the machines picked out as being important, but we didn’t explicitly carve down and look at individual retailers or subsets like that.


Audience: [inaudible 00:41:22].


Patrick Meenan: The point was that, if you’re looking at conversions on retailers, even if the performance is horrible, if it’s an incredible deal, the conversions are still going to be really high. Yeah, I mean, there’s probably a lot of business things around … The user really needed socks so they bought socks or they didn’t need socks, so the conversion stuff is definitely not … That’s sort of one of the reasons I like the bounce stuff better is it’s easier to rationalize how much they’re using your site whereas you can’t necessarily go to Amazon and say hey, increase all your prices as long as you’re fast they’re going to convert, right? There’s a whole lot of other things that drive conversion.


Tammy Everts: This is the value of throwing a lot of data as opposed … There’s always going to [inaudible 00:42:30] cases like that where people are going to come to the sit because you are the only retailer that sells that product or you got some amazing flash sale or something like that, but you have enough data or time you can find these patterns that are kind of supersede that kind of use case.


Patrick Meenan: If you run this on your own data set and you can say, okay, let’s look at the performance of all my flash deal sites or pages, and then see what impact did the performance that any individual user have on their ability to actually go through the flow or if they had 5 minutes on the train that they were just kind of browsing around, buying stuff and seeing if they have a limited time window to do things, did any of that drive the conversions or not.


Tammy Everts: We have a lot of retail customers at Soasta, and I won’t name any names, but there are customers that have had flash sales or campaigns directing people towards landing pages and then realize that there are performance issues. It didn’t matter how great the sale was, people still bounced. People still didn’t convert as much as they wanted them to. That’s a factor but performance still matters even in those situations.


Audience: [inaudible 00:43:48].


Patrick Meenan: Split on the data desktop versus mobile, was it US focused? We had an even mix of mobile and desktop data in the data set, we didn’t actually split it when we did any training, we just had is it mobile or desktop as a metric or flag. I think the data set is skewed towards the US, there are some non-US retailers and things. There’s definitely non-US traffic even though they’re largely US customers.


Audience: [inaudible 00:44:26].


Patrick Meenan: You should probably dig into the data even when you get results you expect. That way you can be scientific about it. That one drove me particularly nuts which is why I spent the extra time looking into it. I looked at a whole bunch of things even when we did get the results we kind of thought we’d see.


Audience: [inaudible 00:44:52].


Patrick Meenan: I’d say review the metrics but also just kind of look at the raw data, sample through the data and see … I looked at what’s the distribution of the data, what’s the range on the data inputs, do I expect that to be the right distribution for these metrics, does it look like the device type is missing on half of the pages for example or something like that. Just kind of do a sanity check while looking at the actual data itself to make sure you really got what you expected and you’re running.


Audience: [inaudible 00:45:28].


Patrick Meenan: Can I ballpark be approximate number of person hours that were spent on this effort? More than I would have liked because I was also figuring out how to do the machine learning.


Tammy Everts: That’s the thing, right? The curve is pretty steep in the beginning.


Patrick Meenan: It’s certainly in the hundreds of man hours but not thousands. It was just me and a few other guys. A few other guys who actually knew what they were doing got me started, and then it was me just kind of make a few changes, run the models. Running the models, you just let the machines run for a few hours so I won’t count that as man hours. I wasn’t actually watching, well I was watching them because I liked watching the numbers and go, okay, how is it improving. Probably on the order or 1 or 200 man hours total.


Audience: [inaudible 00:46:28].


Patrick Meenan: Any chance on releasing the anonymized source data? Unlikely.


Tammy Everts: Probably not, no.


Patrick Meenan: Sorry, but you have your own data, for your own sites. Certainly if you rerun some of this kind of stuff on your own data, publish, talk about it, let us know. It’d be nice to get sort of contrary opinions or validation that yes, I found that it held true with our data. Especially if you run it, target an improvement, make the improvement, and see the results, did the result match the prediction?


Tammy Everts: At Soasta we’re really lucky in that we have so many customers who have contractually agreed to let us do this kind of research on their data as long as it’s anonymized. I can’t thank them personally obviously because that kind of ruins the point of it, but they’re really great and I’m very grateful to them and if you’re one of them sitting out there then thank you very much.


Patrick Meenan: All right, thanks. Enjoy your break.


Tammy Everts: Thank you very much everyone.