- Hosting Piwik
- Writing a script to import the log into Piwik
- Writing a script to download the logs automatically
- Setting up a Cron job to run the combination of both scripts every night
The most famous tool in this area is Google Analytics. While powerful, it sends data from the visitors of my site to a 3rd party (Google) whose business is selling ads by tracking people around the web. That makes me a bit wary regarding users’ privacy.
To keep control of that data, a strong alternative is Piwik. It’s open-source and free to host on your own servers. This put you in control of the collected data and the respect of your users’ privacy (Do Not Track browser setting, IP Anonymisation…).
The first step is obviously to get Piwik running, ideally hosted on a separate vhost and with a separate database from your main website. After creating a new site and database on OVH’s hosting admin panel, Piwik’s installation is pretty straightforward following their user guide.
With Piwik ready to receive data, let’s work backward and look first at how to feed it the logs.
Along with the PHP file providing the UI and endpoint for the JS tracker, Piwik also provides a Python Script for importing logs into it. All our import script needs to do is configure the command with the appropriate options.
The script requires some kind of authentication to import the data into Piwik. Either the login/password combination you use to login to the UI, or an auth token that you can generate from the Piwik UI. Either way, these are kept our of the script to make it portable, and the appropriate flags (
--token-auth) are stored in a separate
The tool also needed a bit of help to parse the logs. The log format for OVH’s shared hosting doesn’t match exactly one of the well-know log formats supported by the import tool. This prevented the script from extracting the host information from the script, so a specific regex had to be provided with
With the host information at hand, the tool could now generate Piwik sites for each host encountered in the log thanks to the
--add-sites-new-hosts. If you’re only interested in specific hosts, you can filter the logs with the
Last, Piwik import tool will ignore static file downloads, HTTP redirects, and HTTP errors by default. It provides
--enable-<XYZ> flags to include them in the import, though. Redirects added too much noise to the stats, but HTTP errors were definitely something I’m interested in. And I’m thinking downloads could help have a view of the RSS feed audience (tracking a query parameter on images loaded from RSS, for example) so I kept them too.
But to import logs into Piwik… well… we need logs. Let’s see how to collect them.
For shared hostings, OVH provides access to GZipped versions of the Apache logs. Of course, they are not publicly accessible. In the “Statistics and logs” section of the hosting admin panel, you can create login/password to clear access to them.
From there, the super structured URLs make it easy to collect the logs for a given date.
As for the Piwik import script, the credentials are kept out of the script. You’ll need to put the necessary
curl flags in a separate
The different parts are ready, time to combine them together.
Cron is a tool that schedules scripts to run at regular intervals. That’s what we’ll use to run the import every night. To make things simpler, one last script is needed, that will combine the two previous one and make the magic happen.
Now all the scripts are ready, the last steps start with uploading them to the server. Ideally, you want them in a folder that’s not accessible from the internet (or at least with no access to the
.*auth files). Make sure they can be executed with
chmod a+x *.sh inside the folder they’re stored. And finally, in the hosting admin panel, you’ll need to schedule a new job to run every day, using the path to the
Note: You might want to test things out before setting up the Cron job. Quick catch there, it’s a bit more convoluted than SSHing to the server and running the last script from the command line, unfortunately. OVH seems to prevent network access to scripts on their shared hosting when run from an SSH connection. All is not lost, though! You can create a small PHP page running the import script and showing its output to try the script out.
Voila! Your shared hosting logs will get imported into Piwik every night and you can start analysing your traffic. If you need to Let me know how this worked for you if you decide to set it up too. And if you have any question, of course, feel free to contact me.