- - -

cat or car?

created on 2024-08-06

tags: [project, coding, ml]
- - -

credit where it is due

cat vs dog machine learning project - basically my go-to tutorial


link to the project: here

demo pics at the end :)


warning: this is not a tutorial. i am sorry for the lack of explanation, but i will make real tutorials with good explanations at another time.

so...im gonna be taking a machine learning class soon (cs559).


i wanted to make something machine learning-related so that im prepared to take on the course material. usually, i tend to do this in order to get ahead (since im very bad at learning things in real time).
therefore, i proudly present...


"cat or car?"

the idea came from...memes on the internet (imma keep it real). however, i was able to put my machine learning concepts to the test. this blog will consist of a concept breakdown, a coding speedrun and...some other things.


concept breakdown

let's start from the beginning. what is machine learning?

machine learning is a subsection of artificial intelligence that focuses on using random numbers and mathematical functions to generate some kind "bigger function". for example, let's say that we want to input an image and have our program tell us if it is a cat or a car. in order to do so, this image is inputted into this "bigger function" - a black box that is unknown to the user. this black box - realistically - is just a bunch of random numbers and mathematical functions. that's all that most of these "machine learning models" are.

this project uses what's called a neural network - a subcategory of machine learning models. neural networks specifically use interconnected "nodes" to produce an output. an image is provided below.


image of a neural network concept diagram

a neural network concept diagram (source)


you can imagine that each circle is a node that contains a number and a weight. a column of circles is known as a layer, and outputs are calculated layer by layer. each layer's nodes have connections to nodes of the next layer (in this diagram, those would be the lines). the layer-by-layer propagation is what's known as forward propagation. however, the brilliance of the neural network is the effect that the output has on the network. once an output is created on the last layer, there is some math that is done to "optimize" the random numbers in all of the layers. this is known as backward propagation. as more information is put through the neural network, the nodes will optimize themselves to become better at their job. cool, right?

now, lets get a bit more specific. in this project, i used a supervised learning model. this basically tells you the way that the backwards propagation works in the model. "supervised" means that the neural network's backward propagation is based on predetermined outputs. in this example, i gave the neural network some pictures of cats and cars, it tried to predict what it was, and then i gave it the actual answer. from there, the neural network can take the difference between my answer and its answer and optimize itself accordingly. there are many different types of learning models, but they get more complicated...you can do your own research on that.

okay, here's one last specification - this project uses a convolutional neural network. this just means that the neural network focuses on computer vision (2d/3d images or videos).
in summary, this project uses a supervised learning model on a convolutional neural network. lots of fancy words, but the meaning is not too complicated.

in this project's convolutional neural network, there are only two interesting types of layers that we need to know about: convolutional layers and max pooling layers. in order to explain these layers, i will assume that you have some knowledge in how images are stored in a computer (rgb colors, pixel values, etc.). in a convolutional layer, there exists a "kernel" - a small matrix of numbers. this kernel is then multiplied repeatedly over the image as if it is "slid" across. an image is provided below for an example. there are more complex types of convolution, but...that's too much to cover here.


gif of convolutional layer calculations

a convolutional layer (source)


a max pooling layer essentially repeatedly takes a chunk of pixels in an image and finds the "maximum value". this is then saved in a new feature map. an image is provided below for an example.


picture of max pooling layer

fig 2 - a max pooling layer layer (source)


both of these layers specialize in feature extraction - essentially taking what might be important for the network while dropping most of the non-important details. both of these layers spit out a smaller image than the one that came in.

the other layer is the standard layer - the one with the nodes and weights. each node in layer (n + 1) will compute the result:


latex: z={v}\cdot{w}%2Bb

variables:


  • z is the resulting vector
  • v is the vector of input values of each node in the last layer
  • w is the vector of weights of each node in the last layer
  • b is the "bias" - a linear transformation that moves the function up or down in value



so far, all of the calculations in the dense layer have been linear. in order to make this a "neural network" (and not a linear regression model), we need a non-linear function to pass the process through. this is known as an activation function. in my code, i use two activation functions - the rectified linear unit (relu) function and the sigmoid function.


relu:


latex: f(x) = max(0, x)

sigmoid:


latex: f(x) = \frac{1}{1%2Be^{-x}}

this all culminates into one number: our output value. in my code, i have suggested that "0" is a cat and "1" is a car. of course, you can have a neural network with multiple outputs - you just have to know how to use them appropriately.

now...remember when i said that neural network can optimize itself through backward propagation? yeah...that math is a little trickier.

lets start with the dense layer - the basic layer. first, we used a loss function - a function that gives an absolute measure of difference between the expected answer and the neural network's answer. one of the examples of a loss function would be mean squared error, as shown in the equation below. y is the neural network's answer, and yhat is the expected answer.


latex: MSE_i=(y_i-\hat{y_i})^2

we then take the average of all of these losses for our training set (n items). let's call that C.


latex: C=\frac{1}{n}\sum_{i=1}^{n}{(y_i-\hat{y_i})^2}

note: this is not the loss function that i used. this will be briefly touched upon in the coding section.

using this average loss, we can use gradients to find the amount by which we should change our values. the basic objective of the gradient in this topic is to find the minimum value of C - our loss function. to do this, we will express the change in C with respect to each weight using the chain rule.


latex: \frac{\delta{C}}{\delta{w_i}}=\frac{\delta{C}}{\delta{w_i}}*\frac{\delta{\hat{y}}}{\delta{z}}*\frac{\delta{z}}{\delta{w_i}}

using multivariable calculus, you can solve the three gradients respectively (i am not going through it, but here is a cool guide on medium - it basically explains all of what i should be explaining). after getting these three gradients, you simply change the weights and biases by their respective gradients multiplied to a learning rate α. the reason we multiply the gradient by a learning rate is to make sure that we dont overstep the minimum that we want to find.

now...for the convolutional and pooling layers, there is a back propagation function for each of these layers as well. however, i am not going to go through them as they require a vast amount of linear algebra knowledge (and ngl...im currently too lazy to type it all out with latex). if you are interested, here is an amazing article in which all of the desired material is covered.

now...let's get into the coding!


the coding

i used python (specifically keras) to build this model. the datasets were taken from the following sources:





i started by cleaning up these files. while there were many folders containing different breeds of cats and cars (for other purposes, perhaps), i only took the folders that said "cat" and "car" as i had no need to seperate different types of each class. i put all the images into two folders: "data/cats" and "data/cars".

since the files were randomly named with a bunch of uppercase letters, i wanted to format them. i ran this block of code to change the file names to "cat_1", "cat_2", "cat_3", etc, etc...

Loading...

the tqdm library is not essential, but it gives you a cool loading bar, so...yeah.

next, i wanted to split my dataset into training and testing data. the training data will do all the fancy math that was mentioned earlier, while the testing data will measure how accurate the neural network is after training. the metrics of the model's test results will be plotted (as seen later).

Loading...

the splitfolders library made things pretty easy for me.

after this, i realized that some of the images were not openable (maybe they were corrupted, idk). in order to fix this so as to not mess up the model, i wrote a block that validated all the images in the new folder, "aggregated_data".

Loading...

now, we get to the more fun part: the model itself!

Loading...

the model contains convolutional 2d layers (since we are dealing with 2d images) and max pooling layers. the flattening layer is simply a layer that flattens layer data (in this case, our feature extracted image) into a 1d array. interpreting this code requires a little bit of knowledge in keras, but it should mostly make sense. the "kernel_initializer" determines the kernel - the sliding matrix that multiplies along the image. the loss function was the rescaling layer rescales the image to a 32x32 image. more on that later.

the loss function that i used - as mentioned earlier - was not mean squared error since my neural network's output was binary (within 0 or 1). because of this, i used the binary crossentropy loss function. this loss function is a bit more complicated, but here is a beautiful mathematical analysis! this model architecture uses what's known as one vgg block (3x3 convolutional + 2x2 max pooling). here is an article with more information on that.

finally, i made the function that would train and test this model.

Loading...

in a nutshell, i took the image datasets that i had produced earlier and created a keras dataset. from there, i ran the "fit" function - it just automatically does the calculations for you. after it finished, i evaluated the model using the evaluate() function. these results are plotted, and the results are saved. the json version of this model is the one that is currently deployed on my website.

that's it! i ran the model and got some results. now, let's talk about the results...

originally, i tried to rescale the images to 200px by 200px. however, upon running the model, the accuracy was not looking too hot...


cross-entropy loss and accuracy of 200px by 200px model

cross-entropy loss and accuracy of 200px by 200px model


(the blue line is training, and the orange line is testing)

after a lot of deliberation, i figured that it was because the image convolution layers were not working efficiently. thus, i changed the size of the images to 64px by 64px, and low and behold...


cross-entropy loss and accuracy of 64px by 64px model

(64px by 64px) lookin a bit better!


you can actually see the cross entropy loss! (im not sure what happened with the first graph)

finally, i decided to go even smaller - 32px by 32px - just to see what would happen. the results were interesting...


cross-entropy loss and accuracy of 32px by 32px model

(32px by 32px) why the graph more erratic doh


interestingly, the testing loss and accuracy seem to fluctuate more near the beginning (implying a more volatile state), but the results are better than that of the 64px by 64px model. this version of the model is the one that is currently deployed.


technical difficulties

man, it was extremely annoying to get keras working on my computer. well, to be more specific...getting tensorflow to detect my gpu was hard. however, this was mainly because...i dont know how to read. i did not realize that tensorflow>=2.10 is not natively supported on windows.

on top of this, tensorflowjs_converter - the library that i used to convert my model from .h5 to .json - does not work on windows. i dont remember how i figured this out, but it definitely took at least half an hour to troubleshoot.


conclusions?

overall, this project took approximately 3 days to make. aside from the mnist database project, this was the first completed project that ive done with machine learning, so i had a lot of fun applying my knowledge!

right now, this model only chooses between cat or car, so if you put something that isnt either of those...itll take its best guess...

once the code is a little more enriching and...not plagerized(?)...i might push it to github.

im sorry if this blog is lacking some pieces here and there - i plan to do more technical blogs in the future, and i plan to make them a bit more tutorial-oriented. i was too lazy to explain most of the math and coding that i did. however, future technical blogs should be much more detailed and complete.

speaking of technical blogs...


you have no idea how long it took me to write this blog.

in one of my next few blogs, i will go over the nightmare of trying to implement latex and code markdown in my portfolio (and all the new additions). until then, ill catch you guys in the next one o7 (please subscribe)

oh yeah, one more time: here's the link to the project!


project pics


model parsing a car image

car parsed successfully!



model parsing a cat image

cat parsed successfully! (fun fact: thats my cat)



model parsing a image that is not a car or a cat

cat...parsed successfully?