From 3fa43e0a437fa776e1dcb4949e7c6d7574239caf Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=C3=89tienne=20Simon?= Date: Mon, 13 Jul 2015 16:06:55 +0000 Subject: Fix markdown (first try?) --- README.md | 164 ++++++++++++++++---------------------------------------------- 1 file changed, 41 insertions(+), 123 deletions(-) (limited to 'README.md') diff --git a/README.md b/README.md index bab0109..b35f35f 100644 --- a/README.md +++ b/README.md @@ -3,132 +3,50 @@ Winning entry to the Kaggle ECML/PKDD destination competition. https://www.kaggle.com/c/pkdd-15-predict-taxi-service-trajectory-i - -**Dependencies** +## Dependencies We used the following packages developped at the MILA lab: -• Theano. A general GPU-accelerated python math library, with an interface similar to numpy (see [3, 4]). http://deeplearning.net/software/theano/ -• Blocks. A deep-learning and neural network framework for Python based on Theano. https://github.com/mila-udem/blocks -• Fuel. A data pipelining framework for Blocks. https://github.com/mila-udem/fuel -We also used the scikit-learn Python library for their mean-shift clustering algorithm. numpy, cPickle and h5py are also used at various places. - - - -**Structure** - - Here is a brief description of the Python files in the archive: - - <\itemize> - : configuration files for the different - models we have experimented with - - The model which gets the best solution is - - - : files related to the data pipeline: - - <\itemize> - contains some general statistics about the - data - - : convert the CSV data file into an - HDF5 file usable directly by Fuel - - : utility functions for exploiting the HDF5 - file - - : initializes the HDF5 file for the - validation set - - : generate a validation set using a - list of time cuts. Cut lists are stored in Python files in - (we used a single cut file) - - : Fuel pipeline for transforming the - training dataset into structures usable by our model - - - > : scripts for various - statistical analyses on the dataset - - <\itemize> - : the script used to generate the - mean-shift clustering of the destination points, producing the 3392 - target points - - - : source code for the various models we tried - - <\itemize> - contains code common to all the models, - including the code for embedding the metadata - - contains code common to all MLP models - - containts code for our MLP - destination prediction model using target points for the output layer - - contains the functions for calculating the - error based on the Haversine Distance +* Theano. A general GPU-accelerated python math library, with an interface similar to numpy (see [3, 4]). http://deeplearning.net/software/theano/ +* Blocks. A deep-learning and neural network framework for Python based on Theano. https://github.com/mila-udem/blocks +* Fuel. A data pipelining framework for Blocks. https://github.com/mila-udem/fuel - contains a Blocks extension for saving - and reloading the model parameters so that training can be interrupted - - contains a Blocks extension that runs the - model on the test set and produces an output CSV submission file - - contains the main code for the training and - testing - - - - **How to reproduce the winning results?** - - - <\enumerate> - Set the environment variable to the path of - the folder containing the CSV files. - - Run to generate the HDF5 file (which - is generated in , along the CSV files). This takes - around 20 minutes on our machines. - - Run to initialize the validation set - HDF5 file. - - Run to generate the - validation set. This can take a few minutes. - - Run to generate the - arrival point clustering. This can take a few minutes. - - Create a folder and a folder - (next to the training script), which will receive - respectively a regular save of the model parameters and many submission - files generated from the model at a regular interval. - - Run to - train the model. Output solutions are generated in - every 1000 iterations. Interrupt the model with three consecutive Ctrl+C - at any times. The training script is set to stop training after 10 000 - 000 iterations, but a result file produced after less than 2 000 000 - iterations is already the winning solution. We trained our model on a - GeForce GTX 680 card and it took about an afternoon to generate the - winning solution. - - When running the training script, set the following Theano flags - environment variable to exploit GPU parallelism: - - +We also used the scikit-learn Python library for their mean-shift clustering algorithm. numpy, cPickle and h5py are also used at various places. - Theano is only compatible with CUDA, which requires an Nvidia GPU. - Training on the CPU is also possible but much slower. - - - - - - More information in this pdf: https://github.com/adbrebs/taxi/blob/master/doc/short_report.pdf - +## Structure + +Here is a brief description of the Python files in the archive: + +* `config/*.py`: configuration files for the different models we have experimented with the model which gets the best solution is `mlp_tgtcls_1_cswdtx_alexandre.py` +* `data/*.py` : files related to the data pipeline: + * `__init__.py` contains some general statistics about the data + * `csv_to_hdf5.py` : convert the CSV data file into an HDF5 file usable directly by Fuel + * `hdf5.py` : utility functions for exploiting the HDF5 file + * `init_valid.py` : initializes the HDF5 file for the validation set + * `make_valid_cut.py` : generate a validation set using a list of time cuts. Cut lists are stored in Python files in `data/cuts/` (we used a single cut file) + * `transformers.py` : Fuel pipeline for transforming the training dataset into structures usable by our model +* `data_analysis/*.py` : scripts for various statistical analyses on the dataset + * `cluster_arrival.py` : the script used to generate the mean-shift clustering of the destination points, producing the 3392 target points +* `model/*.py` : source code for the various models we tried + * `__init__.py` contains code common to all the models, including the code for embedding the metadata + * `mlp.py` contains code common to all MLP models + * `dest_mlp_tgtcls.py` containts code for our MLP destination prediction model using target points for the output layer +* `error.py` contains the functions for calculating the error based on the Haversine Distance +* `ext_saveload.py` contains a Blocks extension for saving and reloading the model parameters so that training can be interrupted +* `ext_test.py` contains a Blocks extension that runs the model on the test set and produces an output CSV submission file +* `train.py` contains the main code for the training and testing + +## How to reproduce the winning results? + +1. Set the `TAXI_PATH` environment variable to the path of the folder containing the CSV files. +2. Run `data/csv_to_hdf5.py` to generate the HDF5 file (which is generated in `TAXI_PATH`, along the CSV files). This takes around 20 minutes on our machines. +3. Run `data/init_valid.py` to initialize the validation set HDF5 file. +4. Run `data/make_valid_cut.py test_times_0` to generate the validation set. This can take a few minutes. +5. Run `data_analysis/cluster_arrival.py` to generate the arrival point clustering. This can take a few minutes. +6. Create a folder `model_data` and a folder `output` (next to the training script), which will receive respectively a regular save of the model parameters and many submission files generated from the model at a regular interval. +7. Run `./train.py dest_mlp_tgtcls_1_cswdtx_alexandre` to train the model. Output solutions are generated in `output/` every 1000 iterations. Interrupt the model with three consecutive Ctrl+C at any times. The training script is set to stop training after 10 000 000 iterations, but a result file produced after less than 2 000 000 iterations is already the winning solution. We trained our model on a GeForce GTX 680 card and it took about an afternoon to generate the winning solution. + When running the training script, set the following Theano flags environment variable to exploit GPU parallelism: + `THEANO_FLAGS=floatX=float32,device=gpu,optimizer=FAST_RUN` + +*More information in this pdf: https://github.com/adbrebs/taxi/blob/master/doc/short_report.pdf* -- cgit v1.2.3