Hierachy generation using PyARXaaS

[1]:
from pyarxaas import ARXaaS
from pyarxaas.privacy_models import KAnonymity, LDiversityDistinct
from pyarxaas import AttributeType
from pyarxaas import Dataset
from pyarxaas.hierarchy import IntervalHierarchyBuilder, RedactionHierarchyBuilder, OrderHierarchyBuilder, DateHierarchyBuilder
import pandas as pd

Create connection to ARXaaS

[2]:
arxaas = ARXaaS("http://localhost:8080")

Fetch data

[3]:
data_df = pd.read_csv("../data/data2.csv", sep=";")
[4]:
data_df
[4]:
zipcode age salary disease
0 47677 29 3 gastric ulcer
1 47602 22 4 gastritis
2 47678 27 5 stomach cancer
3 47905 43 6 gastritis
4 47909 52 11 flu
5 47906 47 8 bronchitis
6 47605 30 7 bronchitis
7 47673 36 9 pneumonia
8 47607 32 10 stomach cancer

Create Redaction based hierarchy

Redaction based hierarchies are hierarchies suited best for categorical but numeric values. Attributes such as zipcodes are a prime canditate. The hierarchy strategy is to delete one number at the time from the attribute column until the privacy model criteria is meet. The hierchy builder can be configured to start deleting from either direction, but will default to RIGHT_TO_LEFT. Redaction hierarchies are the least effort hierarchy to create.

1. Extract column to create hierarchy from

[5]:
zipcodes = data_df["zipcode"].tolist()
zipcodes
[5]:
[47677, 47602, 47678, 47905, 47909, 47906, 47605, 47673, 47607]

2. Create hierarchy builder to use

Here we are specifying a character to use and the order the redaction should follow.

[6]:
redaction_based = RedactionHierarchyBuilder(redaction_char="♥",
                                            redaction_order=RedactionHierarchyBuilder.Order.LEFT_TO_RIGHT)

3. Call the ARXaaS service to create the hierarchy

[7]:
redaction_hierarchy = arxaas.hierarchy(redaction_based, zipcodes)
[8]:
redaction_hierarchy
[8]:
[['47677', '♥7677', '♥♥677', '♥♥♥77', '♥♥♥♥7', '♥♥♥♥♥'],
 ['47602', '♥7602', '♥♥602', '♥♥♥02', '♥♥♥♥2', '♥♥♥♥♥'],
 ['47678', '♥7678', '♥♥678', '♥♥♥78', '♥♥♥♥8', '♥♥♥♥♥'],
 ['47905', '♥7905', '♥♥905', '♥♥♥05', '♥♥♥♥5', '♥♥♥♥♥'],
 ['47909', '♥7909', '♥♥909', '♥♥♥09', '♥♥♥♥9', '♥♥♥♥♥'],
 ['47906', '♥7906', '♥♥906', '♥♥♥06', '♥♥♥♥6', '♥♥♥♥♥'],
 ['47605', '♥7605', '♥♥605', '♥♥♥05', '♥♥♥♥5', '♥♥♥♥♥'],
 ['47673', '♥7673', '♥♥673', '♥♥♥73', '♥♥♥♥3', '♥♥♥♥♥'],
 ['47607', '♥7607', '♥♥607', '♥♥♥07', '♥♥♥♥7', '♥♥♥♥♥']]

Redaction hiearchy without configuration

[9]:
no_config_redaction_based = RedactionHierarchyBuilder() # Create builder
redaction_hierarchy = arxaas.hierarchy(no_config_redaction_based, zipcodes) # pass builder and column to arxaas
redaction_hierarchy
[9]:
[['47677', '4767*', '476**', '47***', '4****', '*****'],
 ['47602', '4760*', '476**', '47***', '4****', '*****'],
 ['47678', '4767*', '476**', '47***', '4****', '*****'],
 ['47905', '4790*', '479**', '47***', '4****', '*****'],
 ['47909', '4790*', '479**', '47***', '4****', '*****'],
 ['47906', '4790*', '479**', '47***', '4****', '*****'],
 ['47605', '4760*', '476**', '47***', '4****', '*****'],
 ['47673', '4767*', '476**', '47***', '4****', '*****'],
 ['47607', '4760*', '476**', '47***', '4****', '*****']]

Create interval based hierarchy

Interval based hierarchies are well suited for continous numeric values. Attributes such as age, income or credit score are typical generalized with a interval hierarchy. The Interval hiearchy builder requires the user to specify intervals in which to generalize values in the attribute into. Optionally these intervals can be labeled. In addition intervals can be grouped upwards using levels and groups to create a deeper hierarchy

1. Extract column to create hierarchy from

[10]:
column = data_df["age"].tolist()
column
[10]:
[29, 22, 27, 43, 52, 47, 30, 36, 32]

2. Create hierarchy builder to use

[11]:
interval_based = IntervalHierarchyBuilder()

3. Add intervals to the builder. The intervals must be continous(without gaps)

[12]:
interval_based.add_interval(0,18, "child")
interval_based.add_interval(18,30, "young-adult")
interval_based.add_interval(30,60, "adult")
interval_based.add_interval(60,120, "old")

4. (Optionally) Add groupings. Groupings are added to a specific level and are order based according to the interval order

[13]:
interval_based.level(0)\
    .add_group(2, "young")\
    .add_group(2, "adult");

3. Call the ARXaaS service to create the hierarchy

[14]:
interval_hierarchy = arxaas.hierarchy(interval_based, column)
[15]:
interval_hierarchy
[15]:
[['29', 'young-adult', 'young', '*'],
 ['22', 'young-adult', 'young', '*'],
 ['27', 'young-adult', 'young', '*'],
 ['43', 'adult', 'adult', '*'],
 ['52', 'adult', 'adult', '*'],
 ['47', 'adult', 'adult', '*'],
 ['30', 'adult', 'adult', '*'],
 ['36', 'adult', 'adult', '*'],
 ['32', 'adult', 'adult', '*']]

Create Order based hierarchy

Order based hierarchies are suited for categorical attributes. Attributes such as country, education level and employment status

1. Extract column to create hierarchy from

[16]:
diseases = data_df["disease"].tolist()

2. Strip to uniques

[17]:
unique_diseases = set(diseases)
unique_diseases = list(unique_diseases)
unique_diseases.sort()
unique_diseases
[17]:
['bronchitis',
 'flu',
 'gastric ulcer',
 'gastritis',
 'pneumonia',
 'stomach cancer']

3. Order column values

As this is a categorical attribute ARXaaS has no way of knowing how to group the values except for the ordering of the values.

[18]:
unique_diseases[2], unique_diseases[4] = unique_diseases[4], unique_diseases[2]
unique_diseases
[18]:
['bronchitis',
 'flu',
 'pneumonia',
 'gastritis',
 'gastric ulcer',
 'stomach cancer']

2. Create hierarchy builder to use

[19]:
order_based = OrderHierarchyBuilder()

3. Group the values

Note that the groups are applied to the values as they are ordered in the list. Adding labels are optional, if labels are not set the resulting field will be a concatination of the values included in the group.

[20]:
order_based.level(0)\
    .add_group(3, "lung-related")\
    .add_group(3, "stomach-related")
[20]:
Level(level=0, groups={Group(grouping=3, label=lung-related): None, Group(grouping=3, label=stomach-related): None})

3. Call the ARXaaS service to create the hierarchy

[21]:
order_hierarchy = arxaas.hierarchy(order_based, unique_diseases)
[22]:
order_hierarchy
[22]:
[['bronchitis', 'lung-related', '*'],
 ['flu', 'lung-related', '*'],
 ['pneumonia', 'lung-related', '*'],
 ['gastritis', 'stomach-related', '*'],
 ['gastric ulcer', 'stomach-related', '*'],
 ['stomach cancer', 'stomach-related', '*']]

Create Date based hierarchy

Date based hierarchies are used for date values that follow the Java SimpleDateFormat

[26]:
dates = ["2020-07-16 15:28:024",
         "2019-07-16 16:38:025",
         "2019-07-16 17:48:025",
         "2019-07-16 18:48:025",
         "2019-06-16 19:48:025",
         "2019-06-16 20:48:025"]

1. Create the builder

the first parameter to the builder is the date_format. The date format specifies how ARXaaS should handle and parse the date strings. The format should follow Java SimpleDateFormat formating. link: https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html

[31]:
date_based = DateHierarchyBuilder("yyyy-MM-dd HH:mm:SSS",
                          DateHierarchyBuilder.Granularity.SECOND_MINUTE_HOUR_DAY_MONTH_YEAR,
                          DateHierarchyBuilder.Granularity.MINUTE_HOUR_DAY_MONTH_YEAR,
                          DateHierarchyBuilder.Granularity.YEAR)
[32]:
date_hierarchy = arxaas.hierarchy(date_based , dates)
[33]:
date_hierarchy
[33]:
[['2020-07-16 15:28:024', '16.07.2020-15:28:00', '16.07.2020-15:28', '2020'],
 ['2019-07-16 16:38:025', '16.07.2019-16:38:00', '16.07.2019-16:38', '2019'],
 ['2019-07-16 17:48:025', '16.07.2019-17:48:00', '16.07.2019-17:48', '2019'],
 ['2019-07-16 18:48:025', '16.07.2019-18:48:00', '16.07.2019-18:48', '2019'],
 ['2019-06-16 19:48:025', '16.06.2019-19:48:00', '16.06.2019-19:48', '2019'],
 ['2019-06-16 20:48:025', '16.06.2019-20:48:00', '16.06.2019-20:48', '2019']]

Example anonymization

[23]:
dataset = Dataset.from_pandas(data_df)
[24]:
dataset.set_attribute_type(AttributeType.IDENTIFYING, "salary")
[25]:
dataset.describe()
data:
  headers:
    ['zipcode', 'age', 'salary', 'disease']
rows:
    [47677, 29, 3, 'gastric ulcer']
    [47602, 22, 4, 'gastritis']
    [47678, 27, 5, 'stomach cancer']
    [47905, 43, 6, 'gastritis']
    [47909, 52, 11, 'flu']
    ...
attributes:
  field_name=zipcode, type=QUASIIDENTIFYING, hierarchy=None
  field_name=age, type=QUASIIDENTIFYING, hierarchy=None
  field_name=salary, type=IDENTIFYING, hierarchy=None
  field_name=disease, type=QUASIIDENTIFYING, hierarchy=None

[26]:
dataset.set_hierarchy("age", interval_hierarchy)
[27]:
dataset.set_hierarchy("zipcode", redaction_hierarchy)
[28]:
dataset.set_hierarchy("disease", order_hierarchy)
[29]:
anon_result = arxaas.anonymize(dataset=dataset, privacy_models=[KAnonymity(2)])
[30]:
anon_result.dataset.to_dataframe()
[30]:
zipcode age salary disease
0 47*** young-adult * stomach-related
1 47*** young-adult * stomach-related
2 47*** young-adult * stomach-related
3 47*** adult * stomach-related
4 47*** adult * lung-related
5 47*** adult * lung-related
6 47*** adult * lung-related
7 47*** adult * lung-related
8 47*** adult * stomach-related
[ ]: