Q&A

    Introduction 

    This provides an overview of answers to the typical questions that our clients have regarding MPC and the Roseman Labs Virtual Data Lake. This overview should be used as a reference document. For follow-up questions please do not hesitate to your Roseman Labs contact person or support@rosemanlabs.com. Below you will find answers to the following questions:

    1. What is MPC?
    2. What is the Virtual Data Lake?
    3. What type of computations can be done on data in the VDL?
    4. How can undesired queries be prevented?
    5. How can you avoid statistical disclosure?
    6. Who can run the queries and who sees the results, are these the same?
    7. What order of magnitude performance can we expect?
    8. Is there a minimum or a maximum number of parties that needs to or can be involved?
    9. Where does the data reside, does it sit in the Virtual Data Lake?
    10. How can the data be kept secret from different parties?
    11. Are there cryptographic keys involved and who holds these?
    12. Who controls the data?
    13. How secure is the Virtual Data Lake?
    14. How can input data quality be ensured if it is not visible?
    15. Can model training be done in the Virtual Data Lake?
    16. I'm getting a CERTIFICATE_VERIFY_FAILED error while using crandas with my local VDL cluster. What should I do?

    What is MPC?

    Multi Party Computation is a “Privacy Enhancing Technology” that allows multiple parties to create insights across multiple data sources, without disclosing the underlying data. Averages, comparisons and more complex calculations can be performed across multiple databases while underlying numbers remain secret. The type of calculations / queries that can be performed on the data are restricted to those agreed upfront. The outcomes are only shared with pre-agreed parties.  

    MPC is based on mathematical protocols first developed in the 1970’s. In the last 40 years new protocols have been developed for more complex calculations. And, in recent years, thanks to increased computing power and further mathematical development, MPC has become market ready. 

    The essence of MPC is that data is encrypted and partitioned in to so called secret shares, at the source. Secret shares cannot reveal anything about the underlying data (as they literally are just a part of the puzzle). The different secret shares are distributed over 3 or more servers that jointly act as the privacy engine. These servers together can execute calculations on the secret shares. Once the calculations are complete, the privacy engine only unencrypts the end result to the pre-agreed party. 

    What is the Virtual Data Lake?

    The Roseman Labs Virtual Data Lake (VDL) is production grade software that puts MPC in practice. The VDL addresses two key obstacles in deploying MPC in practice, MPC expertise and performance.

    MPC expertise is scarce and resides with a small group of specialized cryptographers. Few companies have this expertise in house. To build and maintain applications in MPC this expertise would need to be hired on an ongoing basis. The VDL embeds the MPC technology but also includes an interface in which data scientists can work. The Python scripts they write are translated by the VDL into the MPC environment. Therefore, no specific MCP knowledge is required to create applications / models in the VDL.

    Like any encryption technology MPC requires more compute power than executing the same calculations without encryption. Also, MPC requires sufficient bandwidth and low latency between the servers that are part of the privacy engine. The VDL has been developed with practical performance targets in mind. Roseman Labs MPC software is configured in such a way that the available compute power of the servers is maximally utilized, while communication between the servers is minimized. 

    What type of computations can be done on data in the VDL?

    All common types of computations can be performed in the VDL, including more complex calculations like regressions, decision trees or random forest. Computations require adaptation to MPC to ensure performance. This is the specialty of Roseman Labs cryptography team. Very advanced models such as deep neural networks are not yet possible in the VDL, but this is a very active and maturing area of scientific research.

    How can undesired queries be prevented?

    Every query or type of query needs to be signed off by all parties providing the data. Typically, a steering committee approves the (types of) queries that are allowed. This could for instance include a minimum level of aggregation of the end results (i.e., a minimum number of records in an aggregate). In a more stringent set-up, each individual query is signed of by each of the partners (ex-ante). In a more flexible set-up all queries are logged and reviewed afterwards (ex-post)

    How can you avoid statistical disclosure?

    All queries performed on the VDL are logged. To assess the possibility of statistical disclosure, the query log can be reviewed. At the moment the VDL does not yet assess or mitigate this risk of statistical disclosure automatically. This may become available in the future.  

    Who can run the queries and who sees the results, are these the same?

    The results can either be presented to the party executing that query or (partially) to other parties. For instance, the Virtual Data Lake could produce different partial outputs to different parties. Practically, the inputs and outputs are specified in the analysis (Python) script that is signed off by steerco/trustees. 

    What order of magnitude performance can we expect?

    The performance will be dependent on a number of parameters, including: complexity of the computation, size of the data set and processing power of the servers that jointly form the privacy engine. Indicative numbers are provided in the table below:

    Is there a minimum or a maximum number of parties that needs to or can be involved?

    No. The number of servers in the privacy engine is independent of the number of parties. Two parties cooperating on two data sets would still need 3 or more servers. Those servers could reside anywhere as long as the administration rights to these servers are segregated. 5 parties cooperating could chose for a set up with 3 servers or could choose a set-up with 5 servers, each controlling one server. If a third server is needed, Roseman Labs could provide a third server.

    Where does the data reside, does it sit in the Virtual Data Lake?

    No. The real data only resides in the source database. The privacy engine only holds the secret shares, divided over 3 or more servers. The secret shares are strong encryptions of the input. Individual secret shares do not reveal any information. Only if one would have access to a majority (or all, depending on the MPC protocol), one would be able to reconstruct the source data.  

    How can the data be kept secret from different parties?

    The individual secret shares do not reveal any information. Because the secret shares are divided over 3 or more servers none of the parties operating a server has access to the source information. Depending on the protocol a majority, or all of the parties operating a server would need to collude in order to combine the secret shares and reveal the underlying data. 

    Are there cryptographic keys involved and who holds these?

    Yes there are keys involved during script signing, as well as encryption such as AES. For more information, see here.


    Who controls the data?

    The data owner retains full control over the data. The owner needs to approve computations that are performed on the data, or can retrospectively review them. And, if the owner no longer wants to participate in the partnership it simply withdraws it’s approval for any query. Because the data is not copied to any other place, the data cannot be misused or retained after the data owner withdrew his / her approval. 

    How secure is the Virtual Data Lake?

    As long as parties operating the servers jointly forming the privacy engine do not collude, it is mathematically proven that no data can be revealed. Depending on the chosen protocol, a majority or all parties would need to collude to reveal the source data. 

    How can input data quality be ensured if it is not visible?

    There are various methods available to ensure data quality:

    • Before encryption and partitioning into secret shares the data can be tested on different dimensions. This would happen on the source data, within the environment of the data owner.
    • The VDL user can query a small random/representative sample of the data (of a pre-agreed size and format, for instance, always leaving out sensitive data components like names)

    Can model training be done in the Virtual Data Lake?

    In principal yes although there are three considerations why it may be preferred to do model training, “in the clear”, outside the Virtual Data Lake:

    • When developing and training a model the data scientist typically wants to see the data, which is not possible in the VDL.
    • The VDL is less performant than a database in the clear which means the server would require more time to test and train the model
    • Model development and training can typically be done on a (historic) sub-set of data, which can also be “cleaned” for this specific purpose. 

    I'm getting a CERTIFICATE_VERIFY_FAILED error while using crandas. What should I do?

    If you encounter a `CERTIFICATE_VERIFY_FAILED` error, it's crucial to ensure that you are connecting to the correct port. Specifically, make sure you are connecting to the crandas TLS port, not the node-to-node TLS port. An extra way of spotting this issue is that node0 will complain about an incorrect TLS connection (as crandas uses the wrong TLS certificate compared to the Virtual Data Lake node).