From c5187418bc93c34e3fdce4fdc1a3b5316812b69a Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Fri, 10 Jul 2015 19:20:57 -0400 Subject: Add documentation --- doc/report.tm | 352 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 352 insertions(+) create mode 100644 doc/report.tm (limited to 'doc/report.tm') diff --git a/doc/report.tm b/doc/report.tm new file mode 100644 index 0000000..6165cd1 --- /dev/null +++ b/doc/report.tm @@ -0,0 +1,352 @@ + + + + +<\body> + Winner + Team's Report>| + + >>> + + >|>|>>|||>|||>|>|>>|>>>>>> + + + + Our model is based on a multi-layer perceptron (MLP), a simple feed-forward + neural network architecture. Our MLP model is trained by stochastic + gradient descent (SGD) on the training trajectories. The inputs to our MLP + are the 5 first and 5 last positions of the known part of the trajectory, + as well as embeddings for the context information (date, client and taxi + identification). \ The embeddings are trained with SGD jointly with the MLP + parameters. The MLP outputs probabilities for 3392 target points, and a + mean is calculated to get a unique destination point as an output. We did + no ensembling and used no external data. + + + + We used a mean-shift algorithm on the destination points of all the + training trajectories to extract 3392 classes for the destination point. + These classes were used as a fixed output layer for the MLP architecture. + + We used the embedding method which is common in neural language modeling + approaches (see [1]) to take the metainformation into account in our model. + The following embeddings were used (listed with corresponding + dimensionnality): + + |>|>|>>|||>|||>|||>|||>|||>|||>|||>>>>>>>>>|Embeddings and corresponding dimensions + used by the model> + + The embeddings were first initialized to random variables and were then let + to evolve freely with SGD along with the other model parameters. + + The geographical data input in the network is a centered and normalized + version of the GPS data points. + + We did no other preprocessing or feature selection. + + + + Here is a brief description of the model we used: + + <\itemize> + The input layer of the MLP is the concatenation of + the following inputs: + + <\itemize> + Five first and five last points of the known part of the + trajectory. + + Embeddings for all the metadata. + + + We use a single hidden layer MLP. The hidden + layer is of size 500, and the activation function is a Rectifier Linear + Unit (ie =max>). See [2] for more + information about ReLUs. + + The output layer predicts a probability + vector for the 3392 output classes that we obtained with our clustering + preprocessing step. If > is the probability vector output by + our MLP (output by a softmax layer) and > is the centroid + of cluster , our prediciton is given by: + + <\eqnarray*> + >||p*c>>>> + + + Since > sums to one, this is a valid point on the map. + + We directly train using an approximation of the mean + Haversine Distance as a cost. + + We used a minibatch size of 200. The + optimization algorithm is simple SGD with a learning rate of 0.01 and a + momentum of 0.9. + + To generate our validation set, we tried to + create a set that looked like the training set. For that we generated + ``cuts'' from the training set, i.e. extracted all the taxi rides that + were occuring at given times. The times we selected for our validation + set are similar to those of the test set, only one year before. + + + + + Here is a brief description of the Python files in the archive: + + <\itemize> + : configuration files for the different + models we have experimented with + + The model which gets the best solution is + + + : files related to the data pipeline: + + <\itemize> + contains some general statistics about the + data + + : convert the CSV data file into an + HDF5 file usable directly by Fuel + + : utility functions for exploiting the HDF5 + file + + : initializes the HDF5 file for the + validation set + + : generate a validation set using a + list of time cuts. Cut lists are stored in Python files in + (we used a single cut file) + + : Fuel pipeline for transforming the + training dataset into structures usable by our model + + + > : scripts for various + statistical analyses on the dataset + + <\itemize> + : the script used to generate the + mean-shift clustering of the destination points, producing the 3392 + target points + + + : source code for the various models we tried + + <\itemize> + contains code common to all the models, + including the code for embedding the metadata + + contains code common to all MLP models + + containts code for our MLP + destination prediction model using target points for the output layer + + + contains the functions for calculating the + error based on the Haversine Distance + + contains a Blocks extension for saving + and reloading the model parameters so that training can be interrupted + + contains a Blocks extension that runs the + model on the test set and produces an output CSV submission file + + contains the main code for the training and + testing + + + In the archive we have included only the files listed above, which are + strictly necessary for reproducing our results. More files for the other + models we have tried are available on GitHub at + . + + + + We used the following packages developped at the MILA lab: + + <\itemize> + A general GPU-accelerated python math library, with + an interface similar to numpy (see [3, 4]). + + + A deep-learning and neural network framework for + Python based on Theano. + + A data pipelining framework for Blocks. + + + + We also used the Python library for their + mean-shift clustering algorithm. , and + are also used at various places. + + + + <\enumerate> + Set the environment variable to the path of + the folder containing the CSV files. + + Run to generate the HDF5 file (which + is generated in , along the CSV files). This takes + around 20 minutes on our machines. + + Run to initialize the validation set + HDF5 file. + + Run to generate the + validation set. This can take a few minutes. + + Run to generate the + arrival point clustering. This can take a few minutes. + + Create a folder and a folder + (next to the training script), which will recieve + respectively a regular save of the model parameters and many submission + files generated from the model at a regular interval. + + Run to + train the model. Output solutions are generated in + every 1000 iterations. Interrupt the model with three consecutive Ctrl+C + at any times. The training script is set to stop training after 10 000 + 000 iterations, but a result file produced after less than 2 000 000 + iterations is already the winning solution. The training is quite long + though: we trained our model on a GeForce GTX 680 card and it took about + an afternoon to generate the winning solution. + + When running the training script, set the following Theano flags + environment variable to exploit GPU parallelism: + + + + Theano is only compatible with CUDA, which requires an Nvidia GPUs. + Training on the CPU is also possible but much slower. + + + + + The training examples fed to the model are not full trajectories, since + that would make no sense, but prefixes of those trajectories that are + generated on-the-fly by a Fuel transformer, , + whose code is available in . The data + pipeline is as follows: + + <\itemize> + Select a random full trajectory from the dataset + + Generate a maximum of 100 prefixes for that trajectory. If the + trajectory is smaller than 100 data points, generate all possible + prefixes. Otherwise, chose a random subset of prefixes. Keep the final + destination somewhere as it is used as a target for the training. + + Take only the 5 first and 5 last points of the trajectory. + + At this points we have a stream of prefixes sucessively taken from + different trajectories. We create batches of size 200 with the items of + the previous stream, taken in the order in which they come. The prefixes + generated from a single trajectory may end up in two sucessive batches, + or all in a single batch. + + + + + <\enumerate> + Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. + (2003). A neural probabilistic language model. + , + , 1137-1155. + + Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep + sparse rectifier neural networks. In (pp. 315-323). + + Bergstra, J., Bastien, F., Breuleux, O., Lamblin, + P., Pascanu, R., Delalleau, O., ... & Bengio, Y. (2011). Theano: Deep + learning on gpus with python. In . + + Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., + Goodfellow, I., Bergeron, A., ... & Bengio, Y. (2012). Theano: new + features and speed improvements. . + + + +<\initial> + <\collection> + + + + + +<\references> + <\collection> + > + > + > + > + > + > + > + > + > + > + > + + + +<\auxiliary> + <\collection> + <\associate|table> + > + + <\associate|toc> + |math-font-series||1Summary> + |.>>>>|> + + + |math-font-series||2Feature + Selection/Extraction> |.>>>>|> + + + |math-font-series||3Modelling + Techniques and Training> |.>>>>|> + + + |math-font-series||4Code + Description> |.>>>>|> + + + |math-font-series||5Dependencies> + |.>>>>|> + + + |math-font-series||6How + To Generate The Solution> |.>>>>|> + + + |math-font-series||7Additional + Comments and Observations> |.>>>>|> + + + |math-font-series||8References> + |.>>>>|> + + + + \ No newline at end of file -- cgit v1.2.3