Deploy Linux Container

A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another. Docker is the chosen container platform to produce container images for Aperture Data Studio. Docker container images are available for both Linux and Windows-based applications. Containerized software will always run the same, regardless of the infrastructure.

Terminology

  • Container image is an unchangeable, static file that includes executable code so it can run an isolated process on information technology (IT) infrastructure. The image is comprised of system libraries, system tools and other platforms settings a software program needs to run on a containerization platform.
  • Volume is the mechanism used by Docker to persist the generated data into the host machine.
  • Port is a communication endpoint. Each container can have applications running on ports. If you want to access the application in the container via a port number, you need to map the port number of the container to the port number of the host machine.
  • Docker Compose is a tool for defining and running multi-container Docker applications with one single command.
  • Podman is an alternative container engine (to Docker) that can run Aperture Data Studio containers.
  • Pod is a group of containers managed by Podman that shares the same network and works together.

Overview of Data Studio container ecosystem

Diagram of the topology of the Data Studio Container suite in a Linux machine.

The diagram above illustrates the topology of the Data Studio Container suite in a Linux machine.

Service to port mapping

  • Aperture Data Studio: 7701
  • Aperture Data Studio ODBC: 32527
  • Aperture Find Duplicates: 8080
  • Aperture Find Duplicates Workbench: 8090

Electronic Updates (EU) feeds the latest reference data to Data Studio which is used by the Validate addresses step. The user either needs to have a Windows environment to use the Windows client provided or implement their own integration using the REST API.

Following are the exposed volumes which can be mounted to the host machine:

  • ADS folder: Data Studio repository, duplicate store, log files, JDBC drivers, cache, datasetdropzone.
  • Addons folder: storage for SDK custom step .jar files.
  • Address data folder: storage for Validate addresses step reference data (e.g. batchData/GBR).
  • License folder: storage for Data Studio licenses (including licenses for Find duplicates and Validate addresses steps).
  • Server.properties file: Aperture Data Studio configuration file.
  • Qaworld.ini file: Defines the location of the reference data files and how the data is mapped to a country.
  • Qawserve.ini file: Defines the address layouts used by each data set.

Prerequisite

Installation

Deploying Data Studio containers into Docker

  1. Set up files and folders

    This step will only need to be run once and can be skipped for subsequent upgrades. You will need to execute setup.sh which will provision all the required volumes in the host machine required for the Data Studio Docker container. The folders and files are

    • /opt/ApertureDataStudio/.experian folder: storage for license files.
    • ApertureDataStudio/addons folder: storage for SDK custom step .jar files.
    • ApertureDataStudio/experianmatch folder: storage for find duplicate store files.
    • AddressValidate folder: storage for Validate addresses step reference data (e.g. batchData/GBR).
    • ApertureDataStudio/Server.properties file: Aperture Data Studio configuration file.
    • AddressValidate/Qaworld.ini file: Defines the location of the reference data files and how the data is mapped to a country.
    • AddressValidate/Qawserve.ini file: Defines the address layouts used by each data set.

    The setup.sh also will create the Experian user and assign to the folders above. Experian user is required by Data Studio Docker container for read/write purposes.

    You may modify the location of the volumes in the setup.sh and docker-compose.yml file. For example,

    mkdir -p <new host directory>/addons
    
  2. Load images

    There are 3 images which assemble the whole Data Studio v2 suite: Data Studio, Find Duplicates, and Find Duplicates Workbench. To load these images, you can either perform one of the following steps:

    • Execute load-images.sh from the download package
    • Using docker load command
    sudo docker load -i datastudio-<VERSION>-docker.tar.gz
    sudo docker load -i find-duplicates-<VERSION>-docker.tar.gz
    sudo docker load -i find-duplicates-workbench-<VERSION>-docker.tar.gz
    
  3. Build, start and stop container with docker-compose

    To start the containers, you will need to execute the following command at the folder which contain docker-compose.yml file.

    sudo docker-compose up -d
    

    Following are the content of docker-compose.yml. Let's analyze the code line by line.

    version: "3"
    services:
      datastudio:
        image: "experian/datastudio:2.0.15.178987"
        ports:
          - "7701:7701"
          - "7801:7801"
        volumes:
          - /opt/ApertureDataStudio/.experian:/opt/ApertureDataStudio/.experian
          - ./ApertureDataStudio:/ApertureDataStudio
          - ./ApertureDataStudio/server.properties:/opt/ApertureDataStudio/server.properties
          - ./ApertureDataStudio/addons:/opt/ApertureDataStudio/addons
          - ./AddressValidate/qawserve.ini:/opt/ApertureDataStudio/addressValidate/runtime/qawserve.ini
          - ./AddressValidate/qaworld.ini:/opt/ApertureDataStudio/addressValidate/runtime/qaworld.ini
          - ./AddressValidate/batchData:/opt/ApertureDataStudio/addressValidate/runtime/batchData
      find-duplicates:
        image: "experian/find-duplicates:2.0.15.178987"
        ports:
          - "8080:8080"
        volumes:
          - /opt/ApertureDataStudio/.experian:/opt/ApertureDataStudio/.experian
          - ./ApertureDataStudio/experianmatch:/home/experian/ApertureDataStudio/data/experianmatch
      find-duplicates-workbench:
        image: "experian/find-duplicates-workbench:2.0.15.178987"
        ports:
          - "8090:8090"
        volumes:
          - ./ApertureDataStudio/experianmatch:/opt/FindDuplicatesWorkbench/experianmatch
    
    1. The datastudio service:

      • Uses datastudio 2.0.15.178987 image version.
      • Port "\<host:container>" - exposes 7701 port from container and map to 7701 port from host machine.
      • Volume \<host:container> - mounts a container directory to a host volume.
      • Creates 7 volumes to make the data (ie database, server configuration, etc) persistent.
      • Should the user desire to modify the path for any of the directories, this docker-compose.yml file can be directly changed. For example,
        ```yml
        .//addons:/opt/ApertureDataStudio/addons
      - The address validate reference data directory (batchData in the example above) is to be created and *docker-compose.yml* file is to be configured by the user. 
      - The number of subdirectories in the address validate reference data directory depends on the number of countries subscribed by the user. 
      - Data Studio requires access to local resources to create index files during Workflow execution. Data Studio might be accessing or creating multiple index files per execution depending on the workflow complexity and used steps. Hence, you can enhance *Ulimits* to allow Data Studio accessing higher volume of local resources per execution. This can be set through *soft* (mininum limit) and *hard* (maximum limit) values. For example:
      

      yml
      ulimits:
      nofile:
      soft: "1048576"
      hard: "1048576"

      - You can also specify the memory usage for `datastudio` service through the YAML file. For example
      

      yml
      command: "java -Xms4G -Xmx32G -jar servicerunner-2.2.2.jar STARTUP"
      ```
      The above command starts Data Studio v2.2.2 with memory usage between 4GB and 32GB.

    2. The find-duplicates service:

      • Uses find-duplicates 2.0.15.178987 image version.
      • Shares the same .experian license volume with datastudio service.
      • You can mount the latest Standardize KnowledgeBase (KB), rather than the default which is bundled with the Docker image by adding the following line into docker-compose.yml, under the find-duplicates section.
        yml <host directory which contains the KB>:/opt/Standardize/Data
    3. The find-duplicates-workbench service:

      • Uses find-duplicates-workbench 2.0.15.178987 image version.
      • Shares the same experianmatch volume with find-duplicates service.
    4. You can remove any of the unused services from docker-compose.yml.

  4. [Optional] To stop the containers, you will need to execute the following command at the folder which contain docker-compose.yml file.
    shell sudo docker-compose down

Upgrading Aperture Data Studio Container Image

To upgrade Data Studio, download the new version of Data Studio (container image and setup files) and repeat steps 2 & 3 in the section above.

The newly downloaded load-image.sh and docker-compose.yml should reflect the new Data Studio version so no modifications are required.

Installation

  • Install Podman

    For Red Hat Enterprise Linux 8 (RHEL8), please use commands below to get newer version.

    sudo yum module enable container-tools:rhel8
    sudo yum module install container-tools:rhel8
    

Deploying Data Studio containers into Podman

  1. Load images

    There are 3 images which assemble the whole Data Studio v2 suite: Data Studio, Find Duplicates, and Find Duplicates Workbench. To load these images, you can either perform one of the following steps:

    • Execute load-images-podman.sh from the download package
    • Using podman load command
    podman load -i datastudio-<VERSION>-docker.tar.gz
    podman load -i find-duplicates-<VERSION>-docker.tar.gz
    podman load -i find-duplicates-workbench-<VERSION>-docker.tar.gz
    
  2. Set up files and folders

    This step will only need to be run once and can be skipped for subsequent upgrades. You have to execute the setup-podman.sh, which will create all the required files and folders in current folder.

    chmod +x setup-podman.sh
    ./setup-podman.sh
    
  3. Create pod and containers

    The create-pod-containers.sh will create a pod named "datastudio-pod" and 3 containers from the loaded container images.

    chmod +x create-pod-containers.sh
    ./create-pod-containers.sh
    

    If "datastudio-pod" already used by existing pod, you have to remove the existing pod by running script below:

    chmod +x remove-pod-containers.sh
    ./remove-pod-containers.sh
    
  4. Manage pod

    • List all the available pods.
      shell podman pod ls
    • Start the pod and all the containers of the pod.
      shell podman pod start datastudio-pod
    • Stop the pod and all the containers of the pod.
      shell podman pod stop datastudio-pod

Upgrading Aperture Data Studio Container Image

To upgrade Data Studio, download the new version of Data Studio (container image and setup files) and repeat steps 2 & 3 in the section above.

The newly downloaded load-images-podman.sh and create-pod-containers.sh should reflect the new Data Studio version so no modifications are required.

Limitations (Docker for Aperture Data Studio)

  • Changing of REST web server TCP/IP port has to be updated in both application and docker-compose.yml file (please modify datastudio service -> ports section).
  • An ODBC driver will not be provided for installation on Linux. However, the datasets in the Linux version of Aperture Data Studio can be accessed. This can be done through another machine if Windows ODBC driver is installed and connected to the Linux machine via a Windows-based application like Microsoft Excel.
  • Find Duplicates - the Find Duplicates server is deployed in a separate container. Therefore the configuration needs to be similar to the example below where find-duplicates is the container name. It is recommended that the Aperture Find Duplicates container is allocated 32GB of RAM to be able to run without impacting performance.
  • Docker for Aperture Data Studio is currently only supported on a host machine (physical or virtual machine). Container orchestration platforms such as Azure Container Instance (ACI), Azure Kubernetes Service (AKS), or Openshift Kubernetes are unsupported.

Example of Workflow steps for the Find Duplicates server where find-duplicates is the container name.

Secure Sockets Layer (SSL) for Aperture Data Studio containers in Docker/Podman

Secure Sockets Layer (SSL) can be enabled for Data Studio containers running on Docker or Podman.

To use SSL, you need a certificate. First, get the required certificate and key files. If you don't have them available, use the command below to create two sample files ('example.crt' and 'example.key'):

openssl req -x509 -newkey rsa:4096 -sha256 -days 365 -nodes \
  -keyout example.key -out example.crt -subj "/CN=example.com" \
  -addext "subjectAltName=DNS:example.com,DNS:example.net,IP:10.0.1.4"

To import a certificate:

  1. Place the certificate files (.pem/.crt/.key) in the ApertureDataStudio/cert. A certificate and a key file should already exist in that folder.
  2. (Podman only) Update the ownership of the certificate files to "49082:49082" using the command below: chown -R 49082:49082 ApertureDataStudio
  3. Start the Aperture Data Studio container.
  4. Access Data Studio through the browser using the HTTP protocol.
  5. In Data Studio navigate to Settings > Communication.
  6. Enable Use Secure Sockets Layer (SSL).
  7. Under Key file enter the "cert" folder name followed by a forward slash and the key file name (e.g. cert/example.key).
  8. Repeat the process for the Certificate file. E.g. cert/example.crt.
  9. (Optional) If required, enter a passphrase used for encryption in Key passphrase.
  10. Click Save.
  11. Restart the Aperture Data Studio container. SSL is now enabled. Data Studio can now be accessed using the HTTPS protocol through the browser.
    If you have your own cacerts files, the following changes are required to the docker-compose file:
   environment:
   JVM_OPTS: >
   -Djavax.net.ssl.trustStore=/opt/ApertureDataStudio/cert/cacerts

NGINX as Reverse Proxy for SSL Termination (Docker only)

We recommend using a Docker NGINX container alongside Data Studio if there's a need for a reverse proxy for SSL termination.

All Data Studio ports are fixed in order to run in a Docker container environment.

As such, mapping to different ports through configuring the docker-compose.yml or nginx.conf files will not work.

To use the standard SSL port 443, this will have to be configured in the server.properties file by adding the following line:

Server.httpPort=443
Communication.useSSL=true

Create certificates

A certificate is required in order to use SSL. The command below will create 2 files: example.crt, example.key. These files will be need to be provided to the NGINX container later.

openssl req -x509 -newkey rsa:4096 -sha256 -days 365 -nodes \
  -keyout example.key -out example.crt -subj "/CN=example.com" \
  -addext "subjectAltName=DNS:example.com,DNS:example.net,IP:10.0.1.4"

Add NGINX container

  1. Create a nginx.conf file with the content below:

    server {
        listen 443 ssl;
        ssl_certificate /etc/nginx/certs/example.crt;
        ssl_certificate_key /etc/nginx/certs/example.key;
        location / {
            proxy_http_version 1.1;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection "upgrade";
            proxy_pass https://datastudio:443/;
        }
    }
    
  2. Update the docker-compose.yml file to include the NGINX container.

  3. Mount the folder that contains the example.crt and example.key created in the step earlier above to "/etc/nginx/certs".

  4. Mount the nginx.conf created above to "/etc/nginx/conf.d/nginx.conf".

    version: '3'
    services:
      datastudio:
        image: "experian/datastudio:2.1.0"
        ports:
          - "8443:443"
        volumes:
          - /opt/ApertureDataStudio/.experian:/opt/ApertureDataStudio/.experian
          - ./ApertureDataStudio:/ApertureDataStudio
          - ./ApertureDataStudio/server.properties:/opt/ApertureDataStudio/server.properties
          - ./ApertureDataStudio/addons:/opt/ApertureDataStudio/addons
          - ./AddressValidate/qawserve.ini:/opt/ApertureDataStudio/addressValidate/runtime/qawserve.ini
          - ./AddressValidate/qaworld.ini:/opt/ApertureDataStudio/addressValidate/runtime/qaworld.ini
          - ./AddressValidate/batchData:/opt/ApertureDataStudio/addressValidate/batchData
      nginx: 
        image: "nginx:1.19.0"
        ports : 
          - "443:443"
        volumes: 
          - ./Nginx/certs:/etc/nginx/certs
          - ./Nginx/nginx.conf:/etc/nginx/conf.d/nginx.conf   
        environment:
          no_proxy: datastudio
    

    Once the docker-compose up command is run, access with SSL will be enabled.

Modify log4j2 logging level in containers

  1. Download log4j2.xml and put it into ApertureDataStudio of the host machine. It is the same path as server.properties file located.

  2. Modify the logging level through the downloaded file and mount it to "/opt/ApertureDataStudio/log4j2.xml".

    version: '3'
    services:
      datastudio:
        image: "experian/datastudio:2.1.0"
        ports:
          - "8443:443"
        volumes:
          - /opt/ApertureDataStudio/.experian:/opt/ApertureDataStudio/.experian
          - ./ApertureDataStudio:/ApertureDataStudio
          - ./ApertureDataStudio/server.properties:/opt/ApertureDataStudio/server.properties
          - ./ApertureDataStudio/log4j2.xml:/opt/ApertureDataStudio/log4j2.xml
          - ./ApertureDataStudio/addons:/opt/ApertureDataStudio/addons
          - ./AddressValidate/qawserve.ini:/opt/ApertureDataStudio/addressValidate/runtime/qawserve.ini
          - ./AddressValidate/qaworld.ini:/opt/ApertureDataStudio/addressValidate/runtime/qaworld.ini
          - ./AddressValidate/batchData:/opt/ApertureDataStudio/addressValidate/batchData
    

Modify memory allocation pool for JVM in containers

  1. For the Docker container, add the following command section in docker-compose.yml file. Modify -Xms (minimum heap size which is allocated at initialization of JVM) and -Xmx (maximum heap size that JVM can use) to the intended value.

    version: '3'
    services:
      datastudio:
        image: "experian/datastudio:2.1.0"
        command: java -XX:+UseG1GC -XX:+UseStringDeduplication -Xms16G -Xmx64G -jar servicerunner-2.1.0.jar "STARTUP"
        ports:
          - "8443:443"
        volumes:
          - /opt/ApertureDataStudio/.experian:/opt/ApertureDataStudio/.experian
          - ./ApertureDataStudio:/ApertureDataStudio
          - ./ApertureDataStudio/server.properties:/opt/ApertureDataStudio/server.properties
          - ./ApertureDataStudio/log4j2.xml:/opt/ApertureDataStudio/log4j2.xml
          - ./ApertureDataStudio/addons:/opt/ApertureDataStudio/addons
          - ./AddressValidate/qawserve.ini:/opt/ApertureDataStudio/addressValidate/runtime/qawserve.ini
          - ./AddressValidate/qaworld.ini:/opt/ApertureDataStudio/addressValidate/runtime/qaworld.ini
          - ./AddressValidate/batchData:/opt/ApertureDataStudio/addressValidate/batchData
    
  2. For Podmad, append the java entry point at the end of the podman run command

    podman run --name datastudio localhost/experian/datastudio:2.1.0 java -XX:+UseG1GC -XX:+UseStringDeduplication -Xms16G -Xmx64G -jar servicerunner-2.1.0.jar "STARTUP"

Reset the super-admin user's password

From the Linux machine:

  1. Remote into the container by executing sudo docker exec -it ads_datastudio_1 /bin/bash, where the ads_datastudio_1 referring to the to the provisioned Data Studio container.
  2. In the container, run the command java -jar servicerunner-[version].jar STARTUP RESETADMINPASSWORD, where the version is the current Data Studio version.
  3. Enter the new password and save changes.
  4. Exit from the container by using the exit command.
  5. Re-start the containers by using the docker-compose command.

Configure the Validate addresses step

  1. Get the latest country data files from the Experian Electronic Updates portal.
  2. In QAWSERVE.ini, edit the following:
    • For the InstalledData property, make sure that the batch data refers to the path in the container (e.g., /opt/ApertureDataStudio/addressValidate/batchData).
    • Configure the country to be used for the DataMapping property field (e.g., GBR,Great Britain,GBR).
  3. In server.properties, add Server.AddressValidateInstallPath=/container path/addressValidate/runtime.

Find out more about configuring address validation in Data Studio.

Test the Validate addresses step

To confirm that the step was configured correctly:

  1. Remote into the container using the following command:
    sudo docker exec -it <container name> /bin/bash
  2. Once in, change the directory to the runtime folder.
  3. Execute the following command:
    LD_LIBRARY_PATH=. ./batwv
  4. Enter the number associated with the loaded data and check that the configuration is successful.

JDBC drivers setup

Data Studio provides support for connecting to External systems which can be used either as data sources or as the target of an export operation. One of the supported systems is connectivity to DBMS via JDBC. In order to do so, you will firstly need to setup the JDBC drivers by following the instructions below.

  1. Download the drivers from Community portal.
  2. Unzip the files and put it into jdbc folder.
  3. You may need to create a new jdbc folder if it does not exist and mount the new jdbc folder through volume section in docker-compose.yml as below:
  • [the jdbc folder in host machine]:/opt/ApertureDataStudio/drivers/jdbc

Once done (and with the appropriate licensing add on), you will have access to the following JDBC drivers.

DBMS name Driver file name
Amazon Redshift DSredshift.jar
Apache Hive
  • Microsoft Azure HDInsight
  • Hortonworks Distribution for Apache Hadoop
  • Cloudera's Distribution Including Apache Hadoop (CDH)
  • Amazon Elastic MapReduce (Amazon EMR)
  • IBM BigInsights
  • MapR Distribution for Apache Hadoop
  • Pivotal HD Enterprise (PHD)
DShive.jar
Autonomous Rest Connector (REST API data sources) DSautorest.jar
Cassandra DScassandra.jar
DB2 DSdb2.jar
Google BigQuery DSgooglebigquery.jar
Greenplum
  • Pivotal Greenplum
  • Pivotal HDB (HAWQ)
DSgreenplum.jar
Informix DSinformix.jar
MongoDB DSmongodb.jar
Microsoft Dynamics 365 DSdynamics365.jar
MySql DSmysql.jar
Oracle DSoracle.jar
Oracle Eloqua DSeloqua.jar
Oracle Sales Cloud DSoraclesalescloud.jar
Oracle Service Cloud DSrightnow.jar
PostgreSQL DSpostgresql.jar
Progress OpenEdge DSopenedgewp.jar
Salesforce
  • Salesforce.com
  • Veeva CRM
  • Force.com Applications
  • Financial Force
DSsforce.jar
Snowflake DSsnowflake.jar
Spark SQL DSsparksql.jar
SQL Server
  • Microsoft SQL Server
  • Microsoft SQL Azure
DSsqlserver.jar
Sybase DSsybase.jar

Full documentation for these drivers can be found on the Progress DataDirect Connectors site, under the relevant source.