From b68b188e6b8d6a213607a34c38dda316f701ab9a Mon Sep 17 00:00:00 2001 From: AdeB Date: Sun, 12 Jul 2015 12:24:42 -0400 Subject: Get rid of the shadows below the arrows --- doc/kaggle_blog_post.pptx | Bin 57527 -> 45180 bytes doc/memory_taxi.png | Bin 170245 -> 139884 bytes doc/winning_model.png | Bin 118698 -> 102714 bytes 3 files changed, 0 insertions(+), 0 deletions(-) diff --git a/doc/kaggle_blog_post.pptx b/doc/kaggle_blog_post.pptx index 80b0965..a834444 100644 Binary files a/doc/kaggle_blog_post.pptx and b/doc/kaggle_blog_post.pptx differ diff --git a/doc/memory_taxi.png b/doc/memory_taxi.png index 809a570..d4a21bd 100644 Binary files a/doc/memory_taxi.png and b/doc/memory_taxi.png differ diff --git a/doc/winning_model.png b/doc/winning_model.png index 6d19821..0ac8dfc 100644 Binary files a/doc/winning_model.png and b/doc/winning_model.png differ -- cgit v1.2.3 From 53e6666c1a92136534bc80275c1e6eed3185ff16 Mon Sep 17 00:00:00 2001 From: AdeB Date: Sun, 12 Jul 2015 12:25:08 -0400 Subject: Add Etienne's heatmap --- doc/heatmap_3_5.png | Bin 0 -> 2014098 bytes 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 doc/heatmap_3_5.png diff --git a/doc/heatmap_3_5.png b/doc/heatmap_3_5.png new file mode 100644 index 0000000..bc05371 Binary files /dev/null and b/doc/heatmap_3_5.png differ -- cgit v1.2.3 From 6bc13c852a251cc794b2bd60cd77463ad7a8c59d Mon Sep 17 00:00:00 2001 From: AdeB Date: Mon, 13 Jul 2015 11:00:17 -0400 Subject: Alex's instructions to reproduce the results. --- .gitignore | 2 +- README.md | 131 +++++++++++++++++++++++++++++++++++++++++++++++++++ doc/short_report.pdf | Bin 0 -> 108066 bytes 3 files changed, 132 insertions(+), 1 deletion(-) create mode 100644 doc/short_report.pdf diff --git a/.gitignore b/.gitignore index 22f332f..ffcc498 100644 --- a/.gitignore +++ b/.gitignore @@ -2,7 +2,7 @@ # Source archive submission.tgz -*.pdf +#*.pdf # Byte-compiled / optimized / DLL files __pycache__/ diff --git a/README.md b/README.md index 00eb60b..bab0109 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,134 @@ Winning entry to the Kaggle ECML/PKDD destination competition. https://www.kaggle.com/c/pkdd-15-predict-taxi-service-trajectory-i + + + +**Dependencies** + +We used the following packages developped at the MILA lab: +• Theano. A general GPU-accelerated python math library, with an interface similar to numpy (see [3, 4]). http://deeplearning.net/software/theano/ +• Blocks. A deep-learning and neural network framework for Python based on Theano. https://github.com/mila-udem/blocks +• Fuel. A data pipelining framework for Blocks. https://github.com/mila-udem/fuel +We also used the scikit-learn Python library for their mean-shift clustering algorithm. numpy, cPickle and h5py are also used at various places. + + + +**Structure** + + Here is a brief description of the Python files in the archive: + + <\itemize> + : configuration files for the different + models we have experimented with + + The model which gets the best solution is + + + : files related to the data pipeline: + + <\itemize> + contains some general statistics about the + data + + : convert the CSV data file into an + HDF5 file usable directly by Fuel + + : utility functions for exploiting the HDF5 + file + + : initializes the HDF5 file for the + validation set + + : generate a validation set using a + list of time cuts. Cut lists are stored in Python files in + (we used a single cut file) + + : Fuel pipeline for transforming the + training dataset into structures usable by our model + + + > : scripts for various + statistical analyses on the dataset + + <\itemize> + : the script used to generate the + mean-shift clustering of the destination points, producing the 3392 + target points + + + : source code for the various models we tried + + <\itemize> + contains code common to all the models, + including the code for embedding the metadata + + contains code common to all MLP models + + containts code for our MLP + destination prediction model using target points for the output layer + + + contains the functions for calculating the + error based on the Haversine Distance + + contains a Blocks extension for saving + and reloading the model parameters so that training can be interrupted + + contains a Blocks extension that runs the + model on the test set and produces an output CSV submission file + + contains the main code for the training and + testing + + + + **How to reproduce the winning results?** + + + <\enumerate> + Set the environment variable to the path of + the folder containing the CSV files. + + Run to generate the HDF5 file (which + is generated in , along the CSV files). This takes + around 20 minutes on our machines. + + Run to initialize the validation set + HDF5 file. + + Run to generate the + validation set. This can take a few minutes. + + Run to generate the + arrival point clustering. This can take a few minutes. + + Create a folder and a folder + (next to the training script), which will receive + respectively a regular save of the model parameters and many submission + files generated from the model at a regular interval. + + Run to + train the model. Output solutions are generated in + every 1000 iterations. Interrupt the model with three consecutive Ctrl+C + at any times. The training script is set to stop training after 10 000 + 000 iterations, but a result file produced after less than 2 000 000 + iterations is already the winning solution. We trained our model on a + GeForce GTX 680 card and it took about an afternoon to generate the + winning solution. + + When running the training script, set the following Theano flags + environment variable to exploit GPU parallelism: + + + + Theano is only compatible with CUDA, which requires an Nvidia GPU. + Training on the CPU is also possible but much slower. + + + + + + + More information in this pdf: https://github.com/adbrebs/taxi/blob/master/doc/short_report.pdf + diff --git a/doc/short_report.pdf b/doc/short_report.pdf new file mode 100644 index 0000000..8b5296e Binary files /dev/null and b/doc/short_report.pdf differ -- cgit v1.2.3 From 3fa43e0a437fa776e1dcb4949e7c6d7574239caf Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=C3=89tienne=20Simon?= Date: Mon, 13 Jul 2015 16:06:55 +0000 Subject: Fix markdown (first try?) --- README.md | 164 ++++++++++++++++---------------------------------------------- 1 file changed, 41 insertions(+), 123 deletions(-) diff --git a/README.md b/README.md index bab0109..b35f35f 100644 --- a/README.md +++ b/README.md @@ -3,132 +3,50 @@ Winning entry to the Kaggle ECML/PKDD destination competition. https://www.kaggle.com/c/pkdd-15-predict-taxi-service-trajectory-i - -**Dependencies** +## Dependencies We used the following packages developped at the MILA lab: -• Theano. A general GPU-accelerated python math library, with an interface similar to numpy (see [3, 4]). http://deeplearning.net/software/theano/ -• Blocks. A deep-learning and neural network framework for Python based on Theano. https://github.com/mila-udem/blocks -• Fuel. A data pipelining framework for Blocks. https://github.com/mila-udem/fuel -We also used the scikit-learn Python library for their mean-shift clustering algorithm. numpy, cPickle and h5py are also used at various places. - - - -**Structure** - - Here is a brief description of the Python files in the archive: - - <\itemize> - : configuration files for the different - models we have experimented with - - The model which gets the best solution is - - - : files related to the data pipeline: - - <\itemize> - contains some general statistics about the - data - - : convert the CSV data file into an - HDF5 file usable directly by Fuel - - : utility functions for exploiting the HDF5 - file - - : initializes the HDF5 file for the - validation set - - : generate a validation set using a - list of time cuts. Cut lists are stored in Python files in - (we used a single cut file) - - : Fuel pipeline for transforming the - training dataset into structures usable by our model - - - > : scripts for various - statistical analyses on the dataset - - <\itemize> - : the script used to generate the - mean-shift clustering of the destination points, producing the 3392 - target points - - - : source code for the various models we tried - - <\itemize> - contains code common to all the models, - including the code for embedding the metadata - - contains code common to all MLP models - - containts code for our MLP - destination prediction model using target points for the output layer - - contains the functions for calculating the - error based on the Haversine Distance +* Theano. A general GPU-accelerated python math library, with an interface similar to numpy (see [3, 4]). http://deeplearning.net/software/theano/ +* Blocks. A deep-learning and neural network framework for Python based on Theano. https://github.com/mila-udem/blocks +* Fuel. A data pipelining framework for Blocks. https://github.com/mila-udem/fuel - contains a Blocks extension for saving - and reloading the model parameters so that training can be interrupted - - contains a Blocks extension that runs the - model on the test set and produces an output CSV submission file - - contains the main code for the training and - testing - - - - **How to reproduce the winning results?** - - - <\enumerate> - Set the environment variable to the path of - the folder containing the CSV files. - - Run to generate the HDF5 file (which - is generated in , along the CSV files). This takes - around 20 minutes on our machines. - - Run to initialize the validation set - HDF5 file. - - Run to generate the - validation set. This can take a few minutes. - - Run to generate the - arrival point clustering. This can take a few minutes. - - Create a folder and a folder - (next to the training script), which will receive - respectively a regular save of the model parameters and many submission - files generated from the model at a regular interval. - - Run to - train the model. Output solutions are generated in - every 1000 iterations. Interrupt the model with three consecutive Ctrl+C - at any times. The training script is set to stop training after 10 000 - 000 iterations, but a result file produced after less than 2 000 000 - iterations is already the winning solution. We trained our model on a - GeForce GTX 680 card and it took about an afternoon to generate the - winning solution. - - When running the training script, set the following Theano flags - environment variable to exploit GPU parallelism: - - +We also used the scikit-learn Python library for their mean-shift clustering algorithm. numpy, cPickle and h5py are also used at various places. - Theano is only compatible with CUDA, which requires an Nvidia GPU. - Training on the CPU is also possible but much slower. - - - - - - More information in this pdf: https://github.com/adbrebs/taxi/blob/master/doc/short_report.pdf - +## Structure + +Here is a brief description of the Python files in the archive: + +* `config/*.py`: configuration files for the different models we have experimented with the model which gets the best solution is `mlp_tgtcls_1_cswdtx_alexandre.py` +* `data/*.py` : files related to the data pipeline: + * `__init__.py` contains some general statistics about the data + * `csv_to_hdf5.py` : convert the CSV data file into an HDF5 file usable directly by Fuel + * `hdf5.py` : utility functions for exploiting the HDF5 file + * `init_valid.py` : initializes the HDF5 file for the validation set + * `make_valid_cut.py` : generate a validation set using a list of time cuts. Cut lists are stored in Python files in `data/cuts/` (we used a single cut file) + * `transformers.py` : Fuel pipeline for transforming the training dataset into structures usable by our model +* `data_analysis/*.py` : scripts for various statistical analyses on the dataset + * `cluster_arrival.py` : the script used to generate the mean-shift clustering of the destination points, producing the 3392 target points +* `model/*.py` : source code for the various models we tried + * `__init__.py` contains code common to all the models, including the code for embedding the metadata + * `mlp.py` contains code common to all MLP models + * `dest_mlp_tgtcls.py` containts code for our MLP destination prediction model using target points for the output layer +* `error.py` contains the functions for calculating the error based on the Haversine Distance +* `ext_saveload.py` contains a Blocks extension for saving and reloading the model parameters so that training can be interrupted +* `ext_test.py` contains a Blocks extension that runs the model on the test set and produces an output CSV submission file +* `train.py` contains the main code for the training and testing + +## How to reproduce the winning results? + +1. Set the `TAXI_PATH` environment variable to the path of the folder containing the CSV files. +2. Run `data/csv_to_hdf5.py` to generate the HDF5 file (which is generated in `TAXI_PATH`, along the CSV files). This takes around 20 minutes on our machines. +3. Run `data/init_valid.py` to initialize the validation set HDF5 file. +4. Run `data/make_valid_cut.py test_times_0` to generate the validation set. This can take a few minutes. +5. Run `data_analysis/cluster_arrival.py` to generate the arrival point clustering. This can take a few minutes. +6. Create a folder `model_data` and a folder `output` (next to the training script), which will receive respectively a regular save of the model parameters and many submission files generated from the model at a regular interval. +7. Run `./train.py dest_mlp_tgtcls_1_cswdtx_alexandre` to train the model. Output solutions are generated in `output/` every 1000 iterations. Interrupt the model with three consecutive Ctrl+C at any times. The training script is set to stop training after 10 000 000 iterations, but a result file produced after less than 2 000 000 iterations is already the winning solution. We trained our model on a GeForce GTX 680 card and it took about an afternoon to generate the winning solution. + When running the training script, set the following Theano flags environment variable to exploit GPU parallelism: + `THEANO_FLAGS=floatX=float32,device=gpu,optimizer=FAST_RUN` + +*More information in this pdf: https://github.com/adbrebs/taxi/blob/master/doc/short_report.pdf* -- cgit v1.2.3 From dc430951d6cb660ab804c7e6250aea1acc2dcd9d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=C3=89tienne=20Simon?= Date: Mon, 13 Jul 2015 16:38:51 +0000 Subject: Fix at_least_k --- data/transformers.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/data/transformers.py b/data/transformers.py index f994488..6d3f488 100644 --- a/data/transformers.py +++ b/data/transformers.py @@ -14,7 +14,7 @@ fuel.config.default_seed = 123 def at_least_k(k, v, pad_at_begin, is_longitude): if len(v) == 0: - v = numpy.array([data.porto_center[1 if is_longitude else 0]], dtype=theano.config.floatX) + v = numpy.array([data.train_gps_mean[1 if is_longitude else 0]], dtype=theano.config.floatX) if len(v) < k: if pad_at_begin: v = numpy.concatenate((numpy.full((k - len(v),), v[0]), v)) -- cgit v1.2.3 From c97af300b17ac042c52cfc54f43d4f01fd61fbe9 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=C3=89tienne=20Simon?= Date: Tue, 14 Jul 2015 07:53:03 -0400 Subject: Add prepare.sh to prepare the kaggle data --- README.md | 6 ++-- prepare.sh | 106 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 110 insertions(+), 2 deletions(-) create mode 100644 prepare.sh diff --git a/README.md b/README.md index b35f35f..0aa5f99 100644 --- a/README.md +++ b/README.md @@ -38,10 +38,12 @@ Here is a brief description of the Python files in the archive: * `train.py` contains the main code for the training and testing ## How to reproduce the winning results? + +There is an helper script `prepare.sh` which might helps you (by performing step 1-6 and some other checks), but if you encounter an error, the script will re-execute all the steps from the beginning (before the actual training, step 2, 4 and 5 are quite long). 1. Set the `TAXI_PATH` environment variable to the path of the folder containing the CSV files. -2. Run `data/csv_to_hdf5.py` to generate the HDF5 file (which is generated in `TAXI_PATH`, along the CSV files). This takes around 20 minutes on our machines. -3. Run `data/init_valid.py` to initialize the validation set HDF5 file. +2. Run `data/csv_to_hdf5.py "$TAXI_PATH" "$TAXI_PATH/data.hdf5"` to generate the HDF5 file (which is generated in `TAXI_PATH`, along the CSV files). This takes around 20 minutes on our machines. +3. Run `data/init_valid.py valid.hdf5` to initialize the validation set HDF5 file. 4. Run `data/make_valid_cut.py test_times_0` to generate the validation set. This can take a few minutes. 5. Run `data_analysis/cluster_arrival.py` to generate the arrival point clustering. This can take a few minutes. 6. Create a folder `model_data` and a folder `output` (next to the training script), which will receive respectively a regular save of the model parameters and many submission files generated from the model at a regular interval. diff --git a/prepare.sh b/prepare.sh new file mode 100644 index 0000000..addc3df --- /dev/null +++ b/prepare.sh @@ -0,0 +1,106 @@ +#!/bin/sh + +RESET=`tput sgr0` +BOLD="`tput bold`" +RED="$RESET`tput setaf 1`$BOLD" +GREEN="$RESET`tput setaf 2`" +YELLOW="$RESET`tput setaf 3`" +BLUE="$RESET`tput setaf 4`$BOLD" + +export PYTHONPATH="$PWD:$PYTHONPATH" + +echo "${YELLOW}This script will prepare the data." +echo "${YELLOW}You should run it from inside the repository." +echo "${YELLOW}You should set the TAXI_PATH variable to where the data downloaded from kaggle is." +echo "${YELLOW}Three data files are needed: ${BOLD}train.csv${YELLOW}, ${BOLD}test.csv${YELLOW} and ${BOLD}metaData_taxistandsID_name_GPSlocation.csv.zip${YELLOW}. They can be found at the following url: ${BOLD}https://www.kaggle.com/c/pkdd-15-predict-taxi-service-trajectory-i/data" +if [ ! -e train.py ]; then + echo "${RED}train.py not found, you are not inside the taxi repository." + exit 1 +fi + + +echo -e "\n$BLUE# Checking dependencies" + +python_import(){ + echo -n "${YELLOW}$1... $RESET" + if ! python -c "import $1; print '${GREEN}version', $1.__version__, '${YELLOW}(we used version $2)'"; then + echo "${RED}failed, $1 is not installed" + exit 1 + fi +} + +python_import h5py 2.5.0 +python_import theano 0.7.0.dev +python_import fuel 0.0.1 +python_import blocks 0.0.1 +python_import sklearn 0.16.1 + + +echo -e "\n$BLUE# Checking data" + +echo "${YELLOW}TAXI_PATH is set to $TAXI_PATH" + +md5_check(){ + echo -n "${YELLOW}md5sum $1... $RESET" + if [ ! -e "$TAXI_PATH/$1" ]; then + echo "${RED}file not found, are you sure you set the TAXI_PATH variable correctly?" + exit 1 + fi + md5=`md5sum "$TAXI_PATH/$1" | sed -e 's/ .*//'` + if [ $md5 = $2 ]; then + echo "$GREEN$md5 ok" + else + echo "$RED$md5 failed" + exit 1 + fi +} + +md5_check train.csv 68cc499ac4937a3079ebf69e69e73971 +md5_check test.csv f2ceffde9d98e3c49046c7d998308e71 +md5_check metaData_taxistandsID_name_GPSlocation.csv.zip fecec7286191af868ce8fb208f5c7643 + + +echo -e "\n$BLUE# Extracting metadata" + +echo -n "${YELLOW}unziping... $RESET" +unzip -o "$TAXI_PATH/metaData_taxistandsID_name_GPSlocation.csv.zip" -d "$TAXI_PATH" +echo "${GREEN}ok" + +echo -n "${YELLOW}patching error in metadata csv... $RESET" +sed -e 's/41,Nevogilde,41.163066654-8.67598304213/41,Nevogilde,41.163066654,-8.67598304213/' -i "$TAXI_PATH/metaData_taxistandsID_name_GPSlocation.csv" +echo "${GREEN}ok" + +md5_check metaData_taxistandsID_name_GPSlocation.csv 724805b0b1385eb3efc02e8bdfe9c1df + + +echo -e "\n$BLUE# Conversion of training set to HDF5" +echo "${YELLOW}This might take some time$RESET" +data/csv_to_hdf5.py "$TAXI_PATH" "$TAXI_PATH/data.hdf5" + + +echo -e "\n$BLUE# Generation of validation set" +echo "${YELLOW}This might take some time$RESET" + +echo -n "${YELLOW}initialization... $RESET" +data/init_valid.py +echo "${GREEN}ok" + +echo -n "${YELLOW}cutting... $RESET" +data/make_valid_cut.py test_times_0 +echo "${GREEN}ok" + + +echo -e "\n$BLUE# Generation of destination cluster" +echo "${YELLOW}This might take some time$RESET" +echo -n "${YELLOW}generating... $RESET" +data_analysis/cluster_arrival.py +echo "${GREEN}ok" + + +echo -e "\n$BLUE# Creating output folders" +echo -n "${YELLOW}mkdir model_data... $RESET"; mkdir model_data; echo "${GREEN}ok" +echo -n "${YELLOW}mkdir output... $RESET"; mkdir output; echo "${GREEN}ok" + +echo -e "\n$GREEN${BOLD}The data was successfully prepared" +echo "${YELLOW}To train the winning model on gpu, you can now run the following command:" +echo "${YELLOW}THEANO_FLAGS=floatX=float32,device=gpu,optimizer=FAST_RUN ./train.py dest_mlp_tgtcls_1_cswdtx_alexandre" -- cgit v1.2.3 From 87fc87384e6d9b7d88ca622a17dac7b8bc15cacb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=C3=89tienne=20Simon?= Date: Tue, 14 Jul 2015 07:58:20 -0400 Subject: Add note about PYTHONPATH in README --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 0aa5f99..e9f1ce1 100644 --- a/README.md +++ b/README.md @@ -40,6 +40,8 @@ Here is a brief description of the Python files in the archive: ## How to reproduce the winning results? There is an helper script `prepare.sh` which might helps you (by performing step 1-6 and some other checks), but if you encounter an error, the script will re-execute all the steps from the beginning (before the actual training, step 2, 4 and 5 are quite long). + +Note that some script expect the repository to be in your PYTHONPATH (go to the root of the repository and type `export PYTHONPATH="$PWD:$PYTHONPATH"`). 1. Set the `TAXI_PATH` environment variable to the path of the folder containing the CSV files. 2. Run `data/csv_to_hdf5.py "$TAXI_PATH" "$TAXI_PATH/data.hdf5"` to generate the HDF5 file (which is generated in `TAXI_PATH`, along the CSV files). This takes around 20 minutes on our machines. -- cgit v1.2.3 From a2c922f0397c0438c9163ebeaded159315a01877 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=C3=89tienne=20Simon?= Date: Tue, 14 Jul 2015 07:59:39 -0400 Subject: fix typo in README --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index e9f1ce1..05cd7ed 100644 --- a/README.md +++ b/README.md @@ -39,7 +39,7 @@ Here is a brief description of the Python files in the archive: ## How to reproduce the winning results? -There is an helper script `prepare.sh` which might helps you (by performing step 1-6 and some other checks), but if you encounter an error, the script will re-execute all the steps from the beginning (before the actual training, step 2, 4 and 5 are quite long). +There is an helper script `prepare.sh` which might helps you (by performing steps 1-6 and some other checks), but if you encounter an error, the script will re-execute all the steps from the beginning (before the actual training, steps 2, 4 and 5 are quite long). Note that some script expect the repository to be in your PYTHONPATH (go to the root of the repository and type `export PYTHONPATH="$PWD:$PYTHONPATH"`). -- cgit v1.2.3