A Neural Network Classifier with Keras and Doc2Vec

In the previous two articles, Comparing Similar Video Games and Creating the Map of Video Games, I created a doc2vec and visualized it. In this final article, I will be using a dense neural network to create a classifier for the games.

I will combine the model from the first article, with the clusters in the second article to find out genres for new games.


To create my neural network I will be using Keras. It will be a simple dense neural network with two hidden layers.

from keras.layers import Input, Dense, BatchNormalization
from keras.models import Model
from keras.callbacks import ModelCheckpoint

class ClassifierCategorical():
    def __init__(self, doc_vecs, tag_vecs):
        #This just saves the model every epoch
        callbacks = [
            ModelCheckpoint('classifier_categorical.h5', save_best_only=True)

        inputlen = len(doc_vecs[0])
        outputlen = len(tag_vecs[0])

        #Create a model using Keras functional API
        inputs = Input(shape=(inputlen,))
        x = Dense(inputlen, activation='relu')(inputs)
        x = Dense(inputlen, activation='relu')(inputs)
        outputs = Dense(outputlen, activation='sigmoid')(x)

        #Compile model
        self.model = Model(inputs=inputs, outputs=outputs)

        #Train the model immediately, 

    def save(self, name):

You can see that the things that we need now are doc_vecs and tag_vecs. These are the inputs and outputs, respectively.

The inputs are going to be the docvecs from the doc2vec model I trained with gensim.

The outputs are going to be the genre a game falls under in the Map of Video Games. This data is in the form of a csv and it looks like this.

name clustertag steamtag
Counter-Strike FPS Action
Half-Life 2 FPS FPS
Fallout 3 Post-apocalyptic Open World
... ... ...

If you would like to see how this data was created, please see those articles: Comparing Similar Video Games and Creating the Map of Video Games.


The docvecs are already going to be a vector with 300 dimensions, so we do not have to do much preprocessing for the input.

However, the output is a string, which we need to somehow convert into an array of the form [0, 0, ..., 0, 1, ..., 0, 0]. I will use scikit-learn's OneHotEncoder to do this.

from sklearn.preprocessing import OneHotEncoder

#inputs: create docvecs list that matches with clustertag genre list
def get_docvecs_list(df, docvecs):
    df: pandas.DataFrame
    docvecs: gensim.models.keyedvectors.Doc2VecKeyedVectors

    returns list of numpy arrays
    return [docvecs[name] for name in df['name']]

#outputs: encode categories
def encode_data(df):
    df: pandas.DataFrame

    returns list of lists of encoded categorical array,
    vals = df['clustercat'].values
    vals = vals.reshape(len(vals), 1)
    enc = OneHotEncoder(sparse=False, categories='auto')
    enc_vals = enc.fit_transform(vals)

    return enc_vals, enc

Finally, we can train the neural network.

import pandas as pd
import numpy as np
import gensim

if __name__ == '__main__':
    #load data
    model = gensim.models.Doc2Vec.load("model/doc2vec.model")
    df = pd.read_csv('clustercat.csv', index_col=0)

    #get inputs and outputs
    outs, enc = encode_data(df)
    ins = np.asarray(get_docvecs_list(df, model.docvecs))

    #shuffle the arrays for a random order
    random_state = np.random.get_state()

    #train model
    classifier = ClassifierCategorical(ins, outs)'classifier_categorical.h5')

The full source code can be found here.


Epoch 2/25
17693/17693 [==============================] - 8s 446us/step - 
loss: 1.3323 - acc: 0.6761 - val_loss: 1.2404 - val_acc: 0.6822
Epoch 25/25
17693/17693 [==============================] - 7s 386us/step - 
loss: 0.2990 - acc: 0.9180 - val_loss: 0.9375 - val_acc: 0.7538

It's noted that I am using a GTX1060 so training only took a few minutes. I chose to only train for 25 epochs to minimize overfitting which we can see here. The accuracy of the model is 90% but the validation set is only 75%.

With some tweaks to the dataset, I believe that these results can be improved. For example, there are genres such as 'Free to Play', and 'Multiplayer' which were left unfiltered for the Map of Video Games. I think that those can be left out.

Additionally, there are "duplicates" in the tag list. For example 'FPS' and 'Shooter' which could be consolidated or removed.

For now, I will try to use the trained model on new video games not found in the original dataset. For these games I chose console games and Steam games I saw on "New & Trending". The documents will be from a variety of sources such as Wikipedia, Steam, and User Reviews from various sites. I also tested it with an IGN Critic Review. The documents I will be using can be seen here.

Using the Keras model is very easy

#python file that just holds a dictionary called idocs
from ivecs import idocs
#reusing encode_data from above to get the sklearn encoder
from kerasclassifier import encode_data

from keras.models import load_model
import gensim
import pandas as pd
import numpy as np

if __name__ == '__main__':
    #load cluster genres to load into encode_data()
    df = pd.read_csv('clustercat.csv', index_col=0)
    _, enc = encode_data(df)

    #load gensim models to infer new documents
    g_model = gensim.models.Doc2Vec.load("model/doc2vec.model")

    #load keras model
    k_model = load_model('classifier_categorical.h5')

    #infer new vectors from each document, 
    #results in a list of 300d vectors
    ivecs = []
    for name, doc in idocs.items():
        processed_doc = gensim.utils.simple_preprocess(doc)

    #use keras model to predict the categories, 
    #input 300d vector, output category vector.
    catvecs = k_model.predict(np.asarray(ivecs))

    #use scikit-learn onehotencoder from above 
    #to get the corresponding category with the vector.
    print(list(zip(idocs.keys(), enc.inverse_transform(catvecs))))

How did it do?

('Super Mario Bros. 3', ['Platformer']),          #wikipedia Gameplay section
('Animal Crossing (Franchise)', ['Simulation']),  #wikipedia Gameplay section
('Wii Sports', ['Sports']),                       #wikipedia Gameplay section
('Eastshade', ['Walking Simulator']),             #steam desc & user reviews
('War of Omens Card Game', ['Strategy']),         #steam desc & user reviews
('Wargroove', ['Strategy']),                      #steam desc & user reviews
('Red Dead Redemption 2', ['Simulation']),        #giantbomb user review
('Kingdom Hearts III', ['JRPG']),                 #gamefaqs user review
('The Last of Us', ['Survival']),                 #IGN critic review
('Pong', ['Sports'])                              #wikipedia entire page

It did fairly well considering that only wikipedia pages and steam user reviews were used to train the original doc2vec model.


After a few weeks, I've finally created a successful classifier for video games using Doc2Vec. The Map of Video Games was a fun detour, but I think for the next few articles, I'll focus on some other ideas I have unrelated to video games.

Thanks for reading.