Introduction
On-premise engines added to a Pipeline can have their data files registered for automatic updates. When there is a new version of a data file available, it will be downloaded and the on-premise engine refreshed with it. The file location where the data file was loaded from can also be monitored for changes, so when a data file is manually replaced, the on-premise engine will be refreshed with it.
This page describes the functionality and options for data updates in the pipeline API. In addition, it is worth being familiar with the Distributor web API, which supplies the updated data files.
See the Specification for more technical details.
Registering for updates
All data files added to an on-premise engine have the option to enable automatic updates. By enabling this, the data file is automatically registered when the on-premise engine is added to a Pipeline.
A data file can also be manually registered for automatic updates by registering with the data update service. This works in exactly the same way as if it was registered by the pipeline builder, but in most cases is not necessary as the pipeline builder will do this.
Configuration
There are a number of configuration options available when registering a data file for automatic updates, which specify when and how the data file is updated.
Data update URL
To download a new data file when one becomes available, the data update service must have a URL to download it from. This can be a constant URL, or a URL formatter can be used to dynamically generate the URL based on other options.
License Keys
A License Key may be required when downloading certain types of data files. The data update service uses the License Key, in combination with a URL formatter, to ensure the data file is only made available to licensed users.
File watcher
The location of the data file in the file system can be monitored by enabling the file system watcher. If the data file changes, then the data update service will be called to refresh the on-premise engine using that file. This can be useful when distributing data files to a local cluster.
Polling interval
The polling interval tells the data update service the frequency with which to check for the availability of a new data file when the expected date is not known. If the data file itself provides the date when the next update is expected, then the data update service will not check for updates at all until after that date is passed.
Randomization
In large clusters of servers, it is beneficial to stagger an update. If all servers download a new data file and refresh at the same time, a service's overall performance can be affected. To prevent this, the randomization option enables a random time interval to be added to the time at which the new data file is downloaded. For example, if there are 10 servers, and a full download and refresh takes around 10 seconds, it is sensible to set the randomization to above 10 seconds. In this case, there should only be one server updating at any one time.
URL formatter
Where an on-premise engine needs to download a data file from a URL which is not constant, a URL formatter is used. On-premise engines generally provide the correct URL formatter automatically, but the option to override this is available.
URL formatters are necessary in many cases where multiple data files are available for an on-premise engine. For example, the required format or version of the data file may need to be specified as a parameter in the URL. This is handled by the URL formatter by looking at the current data file to see what is needed.
Temporary file
It is good practice to set a data file to be copied to a temporary location for use by an on-premise engine. This means that whatever mode the file is being used in (e.g., in memory or streamed from file) an update can occur smoothly.
By setting the on-premise engine to use a temporary file location, the original data file is free to be changed by the autoupdateservice. Once the file has been replaced, the on-premise engine will be informed and manage the removal of the temporary file and creation of a new one.
Decompression
Data files are often served as GZipped content from their download URL to minimize the amount of data which needs to be downloaded. When this is the case, the data update service will unzip the data file before carrying on with the process.
Usually an on-premise engine will set this option, along with the URL/URL formatter. But if an alternative URL has been set, then this option may need to be overridden too.
Verify MD5
A server will often provide the MD5 hash of the data file which it has served in the Content-MD5 response header. This can then be checked against that which was actually downloaded to ensure the integrity of the data file. By default this is usually enabled, however not all download servers support it.
Verify 'If-Modified-Since'
Unnecessary downloads can be prevented by providing the download server with an If-Modified-Since HTTP header. If this option is enabled (which it is by default for most on-premise engines) the If-Modified-Since header will be set to the date at which the current data file was last modified. If there is not a newer data file on the server then the service will not attempt to download a file.
Recommendations for large clusters
The Distributor API is limited in the number of requests it can service per day. This is enforced by each License Key being limited to a certain number of requests (max 100, although some Keys will have a lower threshold than this) in each 30-minute period.
Environments that use large numbers of independent nodes can easily exceed this threshold if the automatic update functionality provided by the pipeline API is switched on.
Instead, we recommend that a single machine (with secondaries as necessary for redundancy, etc.) is tasked with downloading the data file each day using curl or similar.
There are then several approaches for deploying the new data file within your environment:
Shared network location
The data file can be placed in a shared network location that is visible to many nodes. The pipeline can then be created using this shared file as the data file location.
If the ‘File System Watcher’ option (described in the sections above) is enabled in the API, it will watch the file for changes and refresh the API when it is updated. Where File System Watcher is not supported by the language, the pipeline will need to be re-created using the new data file.
This approach is very simple to implement, but bear in mind that there is no staggering of the update. When the nodes see the new file, they will all attempt to reload from it in a short space of time. This may cause too much infrastructure load.
This could be mitigated by using multiple shared locations that are updated in a staggered way from a single master copy of the data file. Alternatively, one of the approaches below gives you more control over when updates happen.
In addition, we would only recommend using this setup with the 'MaxPerformance' performance option enabled. Other options will stream data from the data file as needed, which will be relatively slow and bandwidth-heavy.
Push data file to nodes
You can use whatever tools are provided by your environment to push the new file out to the nodes in the cluster.
As above, the File System Watcher must be enabled in order for the API to notice the new data file and refresh its internal data structures.
Self-hosted HTTP update
If the data file is hosted and made available on a static URL that is accessible within your environment, the nodes can be configured to check this location for updates, instead of the Distributor service.
An example of how to configure a device detection engine in this scenario is shown below. You can also configure it in code using the relevant engine builder.
The key settings for our purposes are:
- DataUpdateUrl - The static URL to use when checking for a new data file.
- DataUpdateVerifyMd5 - You can either configure your endpoint to also response with an MD5 HTTP Header, or set this to false to prevent it trying to verify the content.
- DataUpdateUseUrlFormatter - This must be set to false to prevent the API from appending the query string parameters that are required when calling the 51Degrees Distributor service.
Be aware that the compute nodes will start checking the URL for updates as soon as the 'next published date', which is stored in the data file, is reached. This may be before a new data file is actually available from the static URL.
This means that you will probably need to configure your URL to accept and make use of the If-Modified-Since HTTP header to allow the API to check if it needs an update without downloading the entire data file.
See the Distributor documentation for details of how this works.
