From 6bc13c852a251cc794b2bd60cd77463ad7a8c59d Mon Sep 17 00:00:00 2001 From: AdeB Date: Mon, 13 Jul 2015 11:00:17 -0400 Subject: Alex's instructions to reproduce the results. --- README.md | 131 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 131 insertions(+) (limited to 'README.md') diff --git a/README.md b/README.md index 00eb60b..bab0109 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,134 @@ Winning entry to the Kaggle ECML/PKDD destination competition. https://www.kaggle.com/c/pkdd-15-predict-taxi-service-trajectory-i + + + +**Dependencies** + +We used the following packages developped at the MILA lab: +• Theano. A general GPU-accelerated python math library, with an interface similar to numpy (see [3, 4]). http://deeplearning.net/software/theano/ +• Blocks. A deep-learning and neural network framework for Python based on Theano. https://github.com/mila-udem/blocks +• Fuel. A data pipelining framework for Blocks. https://github.com/mila-udem/fuel +We also used the scikit-learn Python library for their mean-shift clustering algorithm. numpy, cPickle and h5py are also used at various places. + + + +**Structure** + + Here is a brief description of the Python files in the archive: + + <\itemize> + : configuration files for the different + models we have experimented with + + The model which gets the best solution is + + + : files related to the data pipeline: + + <\itemize> + contains some general statistics about the + data + + : convert the CSV data file into an + HDF5 file usable directly by Fuel + + : utility functions for exploiting the HDF5 + file + + : initializes the HDF5 file for the + validation set + + : generate a validation set using a + list of time cuts. Cut lists are stored in Python files in + (we used a single cut file) + + : Fuel pipeline for transforming the + training dataset into structures usable by our model + + + > : scripts for various + statistical analyses on the dataset + + <\itemize> + : the script used to generate the + mean-shift clustering of the destination points, producing the 3392 + target points + + + : source code for the various models we tried + + <\itemize> + contains code common to all the models, + including the code for embedding the metadata + + contains code common to all MLP models + + containts code for our MLP + destination prediction model using target points for the output layer + + + contains the functions for calculating the + error based on the Haversine Distance + + contains a Blocks extension for saving + and reloading the model parameters so that training can be interrupted + + contains a Blocks extension that runs the + model on the test set and produces an output CSV submission file + + contains the main code for the training and + testing + + + + **How to reproduce the winning results?** + + + <\enumerate> + Set the environment variable to the path of + the folder containing the CSV files. + + Run to generate the HDF5 file (which + is generated in , along the CSV files). This takes + around 20 minutes on our machines. + + Run to initialize the validation set + HDF5 file. + + Run to generate the + validation set. This can take a few minutes. + + Run to generate the + arrival point clustering. This can take a few minutes. + + Create a folder and a folder + (next to the training script), which will receive + respectively a regular save of the model parameters and many submission + files generated from the model at a regular interval. + + Run to + train the model. Output solutions are generated in + every 1000 iterations. Interrupt the model with three consecutive Ctrl+C + at any times. The training script is set to stop training after 10 000 + 000 iterations, but a result file produced after less than 2 000 000 + iterations is already the winning solution. We trained our model on a + GeForce GTX 680 card and it took about an afternoon to generate the + winning solution. + + When running the training script, set the following Theano flags + environment variable to exploit GPU parallelism: + + + + Theano is only compatible with CUDA, which requires an Nvidia GPU. + Training on the CPU is also possible but much slower. + + + + + + + More information in this pdf: https://github.com/adbrebs/taxi/blob/master/doc/short_report.pdf + -- cgit v1.2.3