From 6bc13c852a251cc794b2bd60cd77463ad7a8c59d Mon Sep 17 00:00:00 2001
From: AdeB <adbrebs@gmail.com>
Date: Mon, 13 Jul 2015 11:00:17 -0400
Subject: Alex's instructions to reproduce the results.

---
 README.md | 131 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 131 insertions(+)

(limited to 'README.md')
diff --git a/README.md b/README.md
index 00eb60b..bab0109 100644
--- a/README.md
+++ b/README.md
@@ -1,3 +1,134 @@
 Winning entry to the Kaggle ECML/PKDD destination competition.
 
 https://www.kaggle.com/c/pkdd-15-predict-taxi-service-trajectory-i
+
+
+
+**Dependencies**
+
+We used the following packages developped at the MILA lab:
+â¢  Theano. A general GPU-accelerated python math library, with an interface similar to numpy (see [3, 4]). http://deeplearning.net/software/theano/
+â¢  Blocks. A deep-learning and neural network framework for Python based on Theano. https://github.com/mila-udem/blocks
+â¢  Fuel. A data pipelining framework for Blocks. https://github.com/mila-udem/fuel 
+We also used the scikit-learn Python library for their mean-shift clustering algorithm. numpy, cPickle and h5py are also used at various places.
+
+
+
+**Structure**
+
+  Here is a brief description of the Python files in the archive:
+
+  <\itemize>
+    <item><verbatim|config/*.py> : configuration files for the different
+    models we have experimented with
+
+    The model which gets the best solution is
+    <verbatim|mlp_tgtcls_1_cswdtx_alexandre.py>
+
+    <item><verbatim|data/*.py> : files related to the data pipeline:
+
+    <\itemize>
+      <item><verbatim|__init__.py> contains some general statistics about the
+      data
+
+      <item><verbatim|csv_to_hdf5.py> : convert the CSV data file into an
+      HDF5 file usable directly by Fuel
+
+      <item><verbatim|hdf5.py> : utility functions for exploiting the HDF5
+      file
+
+      <item><verbatim|init_valid.py> : initializes the HDF5 file for the
+      validation set
+
+      <item><verbatim|make_valid_cut.py> : generate a validation set using a
+      list of time cuts. Cut lists are stored in Python files in
+      <verbatim|data/cuts/> (we used a single cut file)
+
+      <item><verbatim|transformers.py> : Fuel pipeline for transforming the
+      training dataset into structures usable by our model
+    </itemize>
+
+    <item><strong|<verbatim|data_analysis/*.py>> : scripts for various
+    statistical analyses on the dataset
+
+    <\itemize>
+      <item><verbatim|cluster_arrival.py> : the script used to generate the
+      mean-shift clustering of the destination points, producing the 3392
+      target points
+    </itemize>
+
+    <item><verbatim|model/*.py> : source code for the various models we tried
+
+    <\itemize>
+      <item><verbatim|__init__.py> contains code common to all the models,
+      including the code for embedding the metadata
+
+      <item><verbatim|mlp.py> contains code common to all MLP models
+
+      <item><verbatim|dest_mlp_tgtcls.py> containts code for our MLP
+      destination prediction model using target points for the output layer
+    </itemize>
+
+    <item><verbatim|error.py> contains the functions for calculating the
+    error based on the Haversine Distance
+
+    <item><verbatim|ext_saveload.py> contains a Blocks extension for saving
+    and reloading the model parameters so that training can be interrupted
+
+    <item><verbatim|ext_test.py> contains a Blocks extension that runs the
+    model on the test set and produces an output CSV submission file
+
+    <item><verbatim|train.py> contains the main code for the training and
+    testing
+  </itemize>
+  
+  
+  **How to reproduce the winning results?**
+  
+  
+    <\enumerate>
+    <item>Set the <verbatim|TAXI_PATH> environment variable to the path of
+    the folder containing the CSV files.
+
+    <item>Run <verbatim|data/csv_to_hdf5.py> to generate the HDF5 file (which
+    is generated in <verbatim|TAXI_PATH>, along the CSV files). This takes
+    around 20 minutes on our machines.
+
+    <item>Run <verbatim|data/init_valid.py> to initialize the validation set
+    HDF5 file.
+
+    <item>Run <verbatim|data/make_valid_cut.py test_times_0> to generate the
+    validation set. This can take a few minutes.
+
+    <item>Run <verbatim|data_analysis/cluster_arrival.py> to generate the
+    arrival point clustering. This can take a few minutes.
+
+    <item>Create a folder <verbatim|model_data> and a folder
+    <verbatim|output> (next to the training script), which will receive
+    respectively a regular save of the model parameters and many submission
+    files generated from the model at a regular interval.
+
+    <item>Run <verbatim|./train.py dest_mlp_tgtcls_1_cswdtx_alexandre> to
+    train the model. Output solutions are generated in <verbatim|output/>
+    every 1000 iterations. Interrupt the model with three consecutive Ctrl+C
+    at any times. The training script is set to stop training after 10 000
+    000 iterations, but a result file produced after less than 2 000 000
+    iterations is already the winning solution. We trained our model on a
+    GeForce GTX 680 card and it took about an afternoon to generate the
+    winning solution.
+
+    When running the training script, set the following Theano flags
+    environment variable to exploit GPU parallelism:
+
+    <verbatim|THEANO_FLAGS=floatX=float32,device=gpu,optimizer=FAST_RUN>
+
+    Theano is only compatible with CUDA, which requires an Nvidia GPU.
+    Training on the CPU is also possible but much slower.
+  </enumerate>
+
+  
+  
+  
+  
+  More information in this pdf: https://github.com/adbrebs/taxi/blob/master/doc/short_report.pdf
+  
-- 
cgit v1.2.3


From 3fa43e0a437fa776e1dcb4949e7c6d7574239caf Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=89tienne=20Simon?= <esimon@esimon.eu>
Date: Mon, 13 Jul 2015 16:06:55 +0000
Subject: Fix markdown (first try?)

---
 README.md | 164 ++++++++++++++++----------------------------------------------
 1 file changed, 41 insertions(+), 123 deletions(-)

(limited to 'README.md')

diff --git a/README.md b/README.md
index bab0109..b35f35f 100644
--- a/README.md
+++ b/README.md
@@ -3,132 +3,50 @@ Winning entry to the Kaggle ECML/PKDD destination competition.
 https://www.kaggle.com/c/pkdd-15-predict-taxi-service-trajectory-i
 
 
-
-**Dependencies**
+## Dependencies
 
 We used the following packages developped at the MILA lab:
-â¢  Theano. A general GPU-accelerated python math library, with an interface similar to numpy (see [3, 4]). http://deeplearning.net/software/theano/
-â¢  Blocks. A deep-learning and neural network framework for Python based on Theano. https://github.com/mila-udem/blocks
-â¢  Fuel. A data pipelining framework for Blocks. https://github.com/mila-udem/fuel 
-We also used the scikit-learn Python library for their mean-shift clustering algorithm. numpy, cPickle and h5py are also used at various places.
-
-
-
-**Structure**
-
-  Here is a brief description of the Python files in the archive:
-
-  <\itemize>
-    <item><verbatim|config/*.py> : configuration files for the different
-    models we have experimented with
-
-    The model which gets the best solution is
-    <verbatim|mlp_tgtcls_1_cswdtx_alexandre.py>
-
-    <item><verbatim|data/*.py> : files related to the data pipeline:
-
-    <\itemize>
-      <item><verbatim|__init__.py> contains some general statistics about the
-      data
-
-      <item><verbatim|csv_to_hdf5.py> : convert the CSV data file into an
-      HDF5 file usable directly by Fuel
-
-      <item><verbatim|hdf5.py> : utility functions for exploiting the HDF5
-      file
-
-      <item><verbatim|init_valid.py> : initializes the HDF5 file for the
-      validation set
-
-      <item><verbatim|make_valid_cut.py> : generate a validation set using a
-      list of time cuts. Cut lists are stored in Python files in
-      <verbatim|data/cuts/> (we used a single cut file)
-
-      <item><verbatim|transformers.py> : Fuel pipeline for transforming the
-      training dataset into structures usable by our model
-    </itemize>
-
-    <item><strong|<verbatim|data_analysis/*.py>> : scripts for various
-    statistical analyses on the dataset
-
-    <\itemize>
-      <item><verbatim|cluster_arrival.py> : the script used to generate the
-      mean-shift clustering of the destination points, producing the 3392
-      target points
-    </itemize>
-
-    <item><verbatim|model/*.py> : source code for the various models we tried
-
-    <\itemize>
-      <item><verbatim|__init__.py> contains code common to all the models,
-      including the code for embedding the metadata
-
-      <item><verbatim|mlp.py> contains code common to all MLP models
-
-      <item><verbatim|dest_mlp_tgtcls.py> containts code for our MLP
-      destination prediction model using target points for the output layer
-    </itemize>
 
-    <item><verbatim|error.py> contains the functions for calculating the
-    error based on the Haversine Distance
+* Theano. A general GPU-accelerated python math library, with an interface similar to numpy (see [3, 4]). http://deeplearning.net/software/theano/
+* Blocks. A deep-learning and neural network framework for Python based on Theano. https://github.com/mila-udem/blocks
+* Fuel. A data pipelining framework for Blocks. https://github.com/mila-udem/fuel 
 
-    <item><verbatim|ext_saveload.py> contains a Blocks extension for saving
-    and reloading the model parameters so that training can be interrupted
-
-    <item><verbatim|ext_test.py> contains a Blocks extension that runs the
-    model on the test set and produces an output CSV submission file
-
-    <item><verbatim|train.py> contains the main code for the training and
-    testing
-  </itemize>
-  
-  
-  **How to reproduce the winning results?**
-  
-  
-    <\enumerate>
-    <item>Set the <verbatim|TAXI_PATH> environment variable to the path of
-    the folder containing the CSV files.
-
-    <item>Run <verbatim|data/csv_to_hdf5.py> to generate the HDF5 file (which
-    is generated in <verbatim|TAXI_PATH>, along the CSV files). This takes
-    around 20 minutes on our machines.
-
-    <item>Run <verbatim|data/init_valid.py> to initialize the validation set
-    HDF5 file.
-
-    <item>Run <verbatim|data/make_valid_cut.py test_times_0> to generate the
-    validation set. This can take a few minutes.
-
-    <item>Run <verbatim|data_analysis/cluster_arrival.py> to generate the
-    arrival point clustering. This can take a few minutes.
-
-    <item>Create a folder <verbatim|model_data> and a folder
-    <verbatim|output> (next to the training script), which will receive
-    respectively a regular save of the model parameters and many submission
-    files generated from the model at a regular interval.
-
-    <item>Run <verbatim|./train.py dest_mlp_tgtcls_1_cswdtx_alexandre> to
-    train the model. Output solutions are generated in <verbatim|output/>
-    every 1000 iterations. Interrupt the model with three consecutive Ctrl+C
-    at any times. The training script is set to stop training after 10 000
-    000 iterations, but a result file produced after less than 2 000 000
-    iterations is already the winning solution. We trained our model on a
-    GeForce GTX 680 card and it took about an afternoon to generate the
-    winning solution.
-
-    When running the training script, set the following Theano flags
-    environment variable to exploit GPU parallelism:
-
-    <verbatim|THEANO_FLAGS=floatX=float32,device=gpu,optimizer=FAST_RUN>
+We also used the scikit-learn Python library for their mean-shift clustering algorithm. numpy, cPickle and h5py are also used at various places.
 
-    Theano is only compatible with CUDA, which requires an Nvidia GPU.
-    Training on the CPU is also possible but much slower.
-  </enumerate>
 
-  
-  
-  
-  
-  More information in this pdf: https://github.com/adbrebs/taxi/blob/master/doc/short_report.pdf
-  
+## Structure
+
+Here is a brief description of the Python files in the archive:
+
+* `config/*.py`: configuration files for the different models we have experimented with the model which gets the best solution is `mlp_tgtcls_1_cswdtx_alexandre.py`
+* `data/*.py` : files related to the data pipeline:
+  * `__init__.py` contains some general statistics about the data
+  * `csv_to_hdf5.py` : convert the CSV data file into an HDF5 file usable directly by Fuel
+  * `hdf5.py` : utility functions for exploiting the HDF5 file
+  * `init_valid.py` : initializes the HDF5 file for the validation set
+  * `make_valid_cut.py` : generate a validation set using a list of time cuts. Cut lists are stored in Python files in `data/cuts/` (we used a single cut file)
+  * `transformers.py` : Fuel pipeline for transforming the training dataset into structures usable by our model
+* `data_analysis/*.py` : scripts for various statistical analyses on the dataset
+  * `cluster_arrival.py` : the script used to generate the mean-shift clustering of the destination points, producing the 3392 target points
+* `model/*.py` : source code for the various models we tried
+  * `__init__.py` contains code common to all the models, including the code for embedding the metadata
+  * `mlp.py` contains code common to all MLP models
+  * `dest_mlp_tgtcls.py` containts code for our MLP destination prediction model using target points for the output layer
+* `error.py` contains the functions for calculating the error based on the Haversine Distance
+* `ext_saveload.py` contains a Blocks extension for saving and reloading the model parameters so that training can be interrupted
+* `ext_test.py` contains a Blocks extension that runs the model on the test set and produces an output CSV submission file
+* `train.py` contains the main code for the training and testing
+  
+## How to reproduce the winning results?
+  
+1. Set the `TAXI_PATH` environment variable to the path of the folder containing the CSV files.
+2. Run `data/csv_to_hdf5.py` to generate the HDF5 file (which is generated in `TAXI_PATH`, along the CSV files). This takes around 20 minutes on our machines.
+3. Run `data/init_valid.py` to initialize the validation set HDF5 file.
+4. Run `data/make_valid_cut.py test_times_0` to generate the validation set. This can take a few minutes.
+5. Run `data_analysis/cluster_arrival.py` to generate the arrival point clustering. This can take a few minutes.
+6. Create a folder `model_data` and a folder `output` (next to the training script), which will receive respectively a regular save of the model parameters and many submission files generated from the model at a regular interval.
+7. Run `./train.py dest_mlp_tgtcls_1_cswdtx_alexandre` to train the model. Output solutions are generated in `output/` every 1000 iterations. Interrupt the model with three consecutive Ctrl+C at any times. The training script is set to stop training after 10 000 000 iterations, but a result file produced after less than 2 000 000 iterations is already the winning solution. We trained our model on a GeForce GTX 680 card and it took about an afternoon to generate the winning solution.
+   When running the training script, set the following Theano flags environment variable to exploit GPU parallelism:
+   `THEANO_FLAGS=floatX=float32,device=gpu,optimizer=FAST_RUN`
+
+*More information in this pdf: https://github.com/adbrebs/taxi/blob/master/doc/short_report.pdf*
-- 
cgit v1.2.3


From c97af300b17ac042c52cfc54f43d4f01fd61fbe9 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=89tienne=20Simon?= <esimon@esimon.eu>
Date: Tue, 14 Jul 2015 07:53:03 -0400
Subject: Add prepare.sh to prepare the kaggle data

---
 README.md | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

(limited to 'README.md')

diff --git a/README.md b/README.md
index b35f35f..0aa5f99 100644
--- a/README.md
+++ b/README.md
@@ -38,10 +38,12 @@ Here is a brief description of the Python files in the archive:
 * `train.py` contains the main code for the training and testing
   
 ## How to reproduce the winning results?
+
+There is an helper script `prepare.sh` which might helps you (by performing step 1-6 and some other checks), but if you encounter an error, the script will re-execute all the steps from the beginning (before the actual training, step 2, 4 and 5 are quite long).
   
 1. Set the `TAXI_PATH` environment variable to the path of the folder containing the CSV files.
-2. Run `data/csv_to_hdf5.py` to generate the HDF5 file (which is generated in `TAXI_PATH`, along the CSV files). This takes around 20 minutes on our machines.
-3. Run `data/init_valid.py` to initialize the validation set HDF5 file.
+2. Run `data/csv_to_hdf5.py "$TAXI_PATH" "$TAXI_PATH/data.hdf5"` to generate the HDF5 file (which is generated in `TAXI_PATH`, along the CSV files). This takes around 20 minutes on our machines.
+3. Run `data/init_valid.py valid.hdf5` to initialize the validation set HDF5 file.
 4. Run `data/make_valid_cut.py test_times_0` to generate the validation set. This can take a few minutes.
 5. Run `data_analysis/cluster_arrival.py` to generate the arrival point clustering. This can take a few minutes.
 6. Create a folder `model_data` and a folder `output` (next to the training script), which will receive respectively a regular save of the model parameters and many submission files generated from the model at a regular interval.
-- 
cgit v1.2.3


From 87fc87384e6d9b7d88ca622a17dac7b8bc15cacb Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=89tienne=20Simon?= <esimon@esimon.eu>
Date: Tue, 14 Jul 2015 07:58:20 -0400
Subject: Add note about PYTHONPATH in README

---
 README.md | 2 ++
 1 file changed, 2 insertions(+)

(limited to 'README.md')

diff --git a/README.md b/README.md
index 0aa5f99..e9f1ce1 100644
--- a/README.md
+++ b/README.md
@@ -40,6 +40,8 @@ Here is a brief description of the Python files in the archive:
 ## How to reproduce the winning results?
 
 There is an helper script `prepare.sh` which might helps you (by performing step 1-6 and some other checks), but if you encounter an error, the script will re-execute all the steps from the beginning (before the actual training, step 2, 4 and 5 are quite long).
+
+Note that some script expect the repository to be in your PYTHONPATH (go to the root of the repository and type `export PYTHONPATH="$PWD:$PYTHONPATH"`).
   
 1. Set the `TAXI_PATH` environment variable to the path of the folder containing the CSV files.
 2. Run `data/csv_to_hdf5.py "$TAXI_PATH" "$TAXI_PATH/data.hdf5"` to generate the HDF5 file (which is generated in `TAXI_PATH`, along the CSV files). This takes around 20 minutes on our machines.
-- 
cgit v1.2.3


From a2c922f0397c0438c9163ebeaded159315a01877 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=89tienne=20Simon?= <esimon@esimon.eu>
Date: Tue, 14 Jul 2015 07:59:39 -0400
Subject: fix typo in README

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'README.md')

diff --git a/README.md b/README.md
index e9f1ce1..05cd7ed 100644
--- a/README.md
+++ b/README.md
@@ -39,7 +39,7 @@ Here is a brief description of the Python files in the archive:
   
 ## How to reproduce the winning results?
 
-There is an helper script `prepare.sh` which might helps you (by performing step 1-6 and some other checks), but if you encounter an error, the script will re-execute all the steps from the beginning (before the actual training, step 2, 4 and 5 are quite long).
+There is an helper script `prepare.sh` which might helps you (by performing steps 1-6 and some other checks), but if you encounter an error, the script will re-execute all the steps from the beginning (before the actual training, steps 2, 4 and 5 are quite long).
 
 Note that some script expect the repository to be in your PYTHONPATH (go to the root of the repository and type `export PYTHONPATH="$PWD:$PYTHONPATH"`).
   
-- 
cgit v1.2.3