Case Study: Predicting Ford Prices with Previse

Posted on: July 7th, 2022 | Posted by: Landan G




If you're looking to buy a car or sell a car you probably want to know the fair market value or price of that vehicle. Obviously, the seller wants to get the highest possible price while the buyer wants to pay the lowest possible price, but ultimately everyone wants the best price which is typically somewhere in the middle.

This case study involves estimating the fair market value of Ford automobiles, from coupes to trucks using some of the features that buyers may typically look for when buying a vehicle. The data we used in this case study is publicly available from Kaggle and while this case study doesn't use the data our Previse pricing engine uses, it's still a good publicly available dataset that anyone can use.



Data Exploration


To start, lets's look at the data we're working with. Our data has fourteen fields, thirteen that we'll use to predict with and one that we'll be predicting. The target we're trying to predict is the price of the automobile while the inputs that we're using to predict are the region the vehicle is in, the year, the make (in this case, it's all Ford), the model, the condition, cylinder count, fuel type, odometer value, transmission (auto, manual, other), drive (fwd, rwd, 4wd), type (truck, pickup, SUV, van, etc..), and the paint color. The first step of most data science and AI projects after data is done being cleaned and processed is to explore the data and see if we can extract any valuable insights, so that's what we're going to do now.


Overview

  • We have 15,740 distinct listings, all Ford models
  • Our dataset has eleven categorical fields and three numeric fields
  • We have 190 unique regions/cities our data is distributed across. The most common are:
    • Albany - 224
    • Minneapolis / St. Paul - 216
    • Jacksonville - 216
    • Columbus - 213
    • Madison - 207
    • Nashville - 200
  • The most common years in our dataset are:
    • 2013 - 1347
    • 2014 - 1186
    • 2016 - 1041
    • 2017 - 1037
    • 2015 - 1037
  • The most common models in our dataset are:
    • F-150 - 1557
    • Mustang - 482
    • Escape - 458
    • Explorer - 407
    • Focus - 372
  • We also include the condition of the vehicle, the categories and counts are:
    • Good - 6817
    • Excellent - 6633
    • Like New - 1630
    • Fair - 554
    • New - 71
    • Salvage - 35
  • Most of the vehicles are powered by gas, however we have some other categories in our data:
    • Gas - 12907
    • Diesel - 2180
    • Other - 523 (Either unknown or not listed here)
    • Hybrid - 121
    • Electric - 9
  • The overwhelming number of listings have a clean title, while a few aren't:
    • Clean - 15131
    • Rebuilt - 313
    • Salvage - 144
    • Lien - 121
    • Missing - 27
    • Parts only - 4
  • The drivetrain of the vehicles are fairly evenly distributed compared to the above categories:
    • 4WD - 6460
    • RWD - 5063
    • FWD - 2934
    • Unknown - 1283 (Perhaps AWD or unknown?)
  • The colors are an interesting category, most are the standard colors but we have a wide variety of colors included in this case study such as:
    • White - 5722
    • Black - 2461
    • Red - 1846
    • Blue - 1558
    • Silver - 1462
    • Grey - 1161

You can see some of the correlation between fields in our dataset in the below images:




Bringing Machine Learning to our Data


To keep things relatively simple for this case study, we're not going to dive into the code specifically, rather I'll talk about the model we used, the metrics we achieved, and the results of some of our tests. We don't mind writing about our specific ML details since our propriety Bizatta Previse pricing engine uses advanced machine learning that takes in more data and produces more accurate and complete estimates. We're releasing the ML insights shown here to give a high level understanding of how we work.

In the above dataset, we trained a regression model to best predict vehicle price. In fact, we created over 200 different models using xgboost, lightgbm, random forest regressors, sgd regression, and decision tree regressors. By far, the best performing models used xgboost. Our primary evaluation metric was R-Squared (or r2 for short), where a score above 0.5 is considered relatively good, a score above 0.7 shows a high level of correlation, and a score above 0.9 is considered a very high score and is typically required for advanced scientific studies. In short, the higher the r2 score the better, with scores greater than 0.9 being exceptional. We were able to get a test r2 score of 0.87 which considering we used a basic dataset and a general ML model shows how successful our price predictions will be. To be more specific, our various r2 scores are:

  • test_r2_score: 0.868
  • training_r2_score: 0.935
  • val_r2_score: 0.861
Overall, our model using xgboost is extremely successful, and while there is room for improvement we'd rather use our Previse pricing engine for the advanced price predictions.


Predicting Prices


To start our predictions, we're going to predict a vehicle with near-exact overlapping values compared to a the data our model was training on. We're going to change our odometer count and see how that affects the pricing. Our first prediction will be for a 2022 Ford Transit-250 van. For reference, with 14,000 miles the price is ~$35,000. We used our prediction engine on our van except with 100,000 miles and got a response back of an estimated price of $31,671, which makes sense since we know the odometer number does have a small, but noticeable impact on price. If we run the same prediction on a van with one thousand miles, our predicted price is now $38,843.47, which again, makes sense due to the difference in mile count. It's also worth exploring if there's a drop off in how the mileage affects pricing after a certain number of miles.

Our second prediction is for a Ford F-150. What happens to the price if you have a truck with the same specs & location except for the mileage on the odometer and the year of the truck. What happens if we have a nearly identical truck with double the miles on it, but two full years newer? So if a 2018 F-150 with 50,000 miles on it sold for $35,395 then what would a 2020 F-150 with 100,000 miles on it sell for? Our pricing model gave us an estimate of $34,086 which is almost the exact same as the older but less-driven truck. So why does this make sense? The two biggest factors on the price is the odometer and the year, so perhaps a truck being two years newer isn't worth an extra 50,000 miles to someone buying a truck.



Conclusion


The Bizatta Previse pricing engine takes into account more inputs (variables) than most other pricing estimates out there and our data comes from various sources and data providers across the globe, including manufacturer data, 3rd party marketplace data, weather/climate data, general market data, consumer insight data, and more! With our Previse engine, we typically reach accuracy rates anywhere from 90% to 98% ensuring that we provide you the most accurate results possible.

So why use Bizatta Previse for your business? We can tell if you’re listing your vehicles too high or low, if you should adjust them, as well as advanced insights such as price history, how the day of the week affects sale price, how temperature and snow/rain affects the sale price, and much more ensuring you get the most money possible for your business, while selling inventory faster than your competitors.