Okay, so, I wanted to do something fun with data and baseball, because, you know, I’m a big fan. The idea popped into my head to mess around with machine learning and see if I could predict anything interesting about the MLB players in the Big Apple. It’s kind of a vague idea, but I was excited to see where it would lead.
First off, I had to get my hands on some data. There are a bunch of places online where you can grab baseball stats, so I started digging. I spent a good chunk of time scraping data from different websites. It was a bit of a mess, honestly, jumping between sites, copying and pasting stuff, and trying to make sure everything was consistent. I did this until I felt I have enough.
After I managed to get a decent amount of data together, I needed to clean it up. This part is always tedious. I had to deal with missing values, inconsistencies, and all sorts of weird formatting issues. I used a bunch of Python libraries like Pandas and NumPy to help me out. I remember spending hours just staring at spreadsheets, trying to figure out the best way to organize everything.
Then, once the data looked somewhat presentable, I started exploring it. This is where things got a bit more interesting. I made a bunch of charts and graphs to visualize the data. There were bar graphs showing the distribution of player positions, scatter plots comparing ages and salaries, and histograms showing the spread of batting averages, among other things. It was cool to see all the numbers come to life in a visual way.
Once I’m more confident, I figured it was time to dive into the machine learning part. I decided to use some simple models to start with, just to get a feel for things. I played around with linear regression to see if I could predict player salaries based on their performance stats. I also tried out some classification models, like logistic regression and decision trees, to predict whether a player would be traded or not based on various factors. I did not know what I was expecting really, but it’s worth the try.
The results were, well, mixed. Some models performed okay, others not so much. It was a lot of trial and error, tweaking parameters, and trying out different models. I recall feeling pretty frustrated at times, especially when a model that I thought would work really well ended up being a total flop. But, you know, that’s just part of the process.
After a lot of experimenting, I did manage to get a few models that showed some promise. For instance, I found that a random forest classifier could predict whether a player would be traded with reasonable accuracy. I also found that, I’m not sure about this, but a player’s age and on-base percentage could be used to predict their salary, although the relationship wasn’t super strong.
What I have learned
- First, I’ve learned that data cleaning is a pain, but it’s super important.
- Second, machine learning can be a wild ride. Sometimes it works, sometimes it doesn’t, and it takes a lot of patience to figure out what’s going on.
- Third, baseball data is fascinating, and there are tons of interesting questions you can explore with it.
I’m not sure if I’ll continue working on this particular project, but it was definitely a fun learning experience. I might try to apply some more advanced techniques, like neural networks, in the future. Or, maybe I’ll just move on to a different sport altogether. Who knows?

Anyway, that’s my little adventure with Big Apple MLB players and machine learning. It was a bit of a rollercoaster, but I enjoyed the ride. If you’re into baseball or data science, I’d definitely recommend giving something like this a try. Just be prepared for a bit of a bumpy ride!