Hierachy generation using PyARXaaS¶

[1]:

from pyarxaas import ARXaaS
from pyarxaas.privacy_models import KAnonymity, LDiversityDistinct
from pyarxaas import AttributeType
from pyarxaas import Dataset
from pyarxaas.hierarchy import IntervalHierarchyBuilder, RedactionHierarchyBuilder, OrderHierarchyBuilder, DateHierarchyBuilder
import pandas as pd

Create connection to ARXaaS¶

[2]:

arxaas = ARXaaS("http://localhost:8080")

Fetch data¶

[3]:

data_df = pd.read_csv("../data/data2.csv", sep=";")

[4]:

data_df

[4]:

	zipcode	age	salary	disease
0	47677	29	3	gastric ulcer
1	47602	22	4	gastritis
2	47678	27	5	stomach cancer
3	47905	43	6	gastritis
4	47909	52	11	flu
5	47906	47	8	bronchitis
6	47605	30	7	bronchitis
7	47673	36	9	pneumonia
8	47607	32	10	stomach cancer

Create Redaction based hierarchy¶

Redaction based hierarchies are hierarchies suited best for categorical but numeric values. Attributes such as zipcodes are a prime canditate. The hierarchy strategy is to delete one number at the time from the attribute column until the privacy model criteria is meet. The hierchy builder can be configured to start deleting from either direction, but will default to RIGHT_TO_LEFT. Redaction hierarchies are the least effort hierarchy to create.

1. Extract column to create hierarchy from¶

[5]:

zipcodes = data_df["zipcode"].tolist()
zipcodes

[5]:

[47677, 47602, 47678, 47905, 47909, 47906, 47605, 47673, 47607]

2. Create hierarchy builder to use¶

Here we are specifying a character to use and the order the redaction should follow.

[6]:

redaction_based = RedactionHierarchyBuilder(redaction_char="♥",
                                            redaction_order=RedactionHierarchyBuilder.Order.LEFT_TO_RIGHT)

3. Call the ARXaaS service to create the hierarchy¶

[7]:

redaction_hierarchy = arxaas.hierarchy(redaction_based, zipcodes)

[8]:

redaction_hierarchy

[8]:

[['47677', '♥7677', '♥♥677', '♥♥♥77', '♥♥♥♥7', '♥♥♥♥♥'],
 ['47602', '♥7602', '♥♥602', '♥♥♥02', '♥♥♥♥2', '♥♥♥♥♥'],
 ['47678', '♥7678', '♥♥678', '♥♥♥78', '♥♥♥♥8', '♥♥♥♥♥'],
 ['47905', '♥7905', '♥♥905', '♥♥♥05', '♥♥♥♥5', '♥♥♥♥♥'],
 ['47909', '♥7909', '♥♥909', '♥♥♥09', '♥♥♥♥9', '♥♥♥♥♥'],
 ['47906', '♥7906', '♥♥906', '♥♥♥06', '♥♥♥♥6', '♥♥♥♥♥'],
 ['47605', '♥7605', '♥♥605', '♥♥♥05', '♥♥♥♥5', '♥♥♥♥♥'],
 ['47673', '♥7673', '♥♥673', '♥♥♥73', '♥♥♥♥3', '♥♥♥♥♥'],
 ['47607', '♥7607', '♥♥607', '♥♥♥07', '♥♥♥♥7', '♥♥♥♥♥']]

Redaction hiearchy without configuration¶

[9]:

no_config_redaction_based = RedactionHierarchyBuilder() # Create builder
redaction_hierarchy = arxaas.hierarchy(no_config_redaction_based, zipcodes) # pass builder and column to arxaas
redaction_hierarchy

[9]:

[['47677', '4767*', '476**', '47***', '4****', '*****'],
 ['47602', '4760*', '476**', '47***', '4****', '*****'],
 ['47678', '4767*', '476**', '47***', '4****', '*****'],
 ['47905', '4790*', '479**', '47***', '4****', '*****'],
 ['47909', '4790*', '479**', '47***', '4****', '*****'],
 ['47906', '4790*', '479**', '47***', '4****', '*****'],
 ['47605', '4760*', '476**', '47***', '4****', '*****'],
 ['47673', '4767*', '476**', '47***', '4****', '*****'],
 ['47607', '4760*', '476**', '47***', '4****', '*****']]

Create interval based hierarchy¶

Interval based hierarchies are well suited for continous numeric values. Attributes such as age, income or credit score are typical generalized with a interval hierarchy. The Interval hiearchy builder requires the user to specify intervals in which to generalize values in the attribute into. Optionally these intervals can be labeled. In addition intervals can be grouped upwards using levels and groups to create a deeper hierarchy

1. Extract column to create hierarchy from¶

[10]:

column = data_df["age"].tolist()
column

[10]:

[29, 22, 27, 43, 52, 47, 30, 36, 32]

2. Create hierarchy builder to use¶

[11]:

interval_based = IntervalHierarchyBuilder()

3. Add intervals to the builder. The intervals must be continous(without gaps)¶

[12]:

interval_based.add_interval(0,18, "child")
interval_based.add_interval(18,30, "young-adult")
interval_based.add_interval(30,60, "adult")
interval_based.add_interval(60,120, "old")

4. (Optionally) Add groupings. Groupings are added to a specific level and are order based according to the interval order¶

[13]:

interval_based.level(0)\
    .add_group(2, "young")\
    .add_group(2, "adult");

3. Call the ARXaaS service to create the hierarchy¶

[14]:

interval_hierarchy = arxaas.hierarchy(interval_based, column)

[15]:

interval_hierarchy

[15]:

[['29', 'young-adult', 'young', '*'],
 ['22', 'young-adult', 'young', '*'],
 ['27', 'young-adult', 'young', '*'],
 ['43', 'adult', 'adult', '*'],
 ['52', 'adult', 'adult', '*'],
 ['47', 'adult', 'adult', '*'],
 ['30', 'adult', 'adult', '*'],
 ['36', 'adult', 'adult', '*'],
 ['32', 'adult', 'adult', '*']]

Create Order based hierarchy¶

Order based hierarchies are suited for categorical attributes. Attributes such as country, education level and employment status

1. Extract column to create hierarchy from¶

[16]:

diseases = data_df["disease"].tolist()

2. Strip to uniques¶

[17]:

unique_diseases = set(diseases)
unique_diseases = list(unique_diseases)
unique_diseases.sort()
unique_diseases

[17]:

['bronchitis',
 'flu',
 'gastric ulcer',
 'gastritis',
 'pneumonia',
 'stomach cancer']

3. Order column values¶

As this is a categorical attribute ARXaaS has no way of knowing how to group the values except for the ordering of the values.

[18]:

unique_diseases[2], unique_diseases[4] = unique_diseases[4], unique_diseases[2]
unique_diseases

[18]:

['bronchitis',
 'flu',
 'pneumonia',
 'gastritis',
 'gastric ulcer',
 'stomach cancer']

2. Create hierarchy builder to use¶

[19]:

order_based = OrderHierarchyBuilder()

3. Group the values¶

Note that the groups are applied to the values as they are ordered in the list. Adding labels are optional, if labels are not set the resulting field will be a concatination of the values included in the group.

[20]:

order_based.level(0)\
    .add_group(3, "lung-related")\
    .add_group(3, "stomach-related")

[20]:

Level(level=0, groups={Group(grouping=3, label=lung-related): None, Group(grouping=3, label=stomach-related): None})

3. Call the ARXaaS service to create the hierarchy¶

[21]:

order_hierarchy = arxaas.hierarchy(order_based, unique_diseases)

[22]:

order_hierarchy

[22]:

[['bronchitis', 'lung-related', '*'],
 ['flu', 'lung-related', '*'],
 ['pneumonia', 'lung-related', '*'],
 ['gastritis', 'stomach-related', '*'],
 ['gastric ulcer', 'stomach-related', '*'],
 ['stomach cancer', 'stomach-related', '*']]

Create Date based hierarchy¶

Date based hierarchies are used for date values that follow the Java SimpleDateFormat

[26]:

dates = ["2020-07-16 15:28:024",
         "2019-07-16 16:38:025",
         "2019-07-16 17:48:025",
         "2019-07-16 18:48:025",
         "2019-06-16 19:48:025",
         "2019-06-16 20:48:025"]

1. Create the builder¶

the first parameter to the builder is the date_format. The date format specifies how ARXaaS should handle and parse the date strings. The format should follow Java SimpleDateFormat formating. link: https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html

[31]:

date_based = DateHierarchyBuilder("yyyy-MM-dd HH:mm:SSS",
                          DateHierarchyBuilder.Granularity.SECOND_MINUTE_HOUR_DAY_MONTH_YEAR,
                          DateHierarchyBuilder.Granularity.MINUTE_HOUR_DAY_MONTH_YEAR,
                          DateHierarchyBuilder.Granularity.YEAR)

[32]:

date_hierarchy = arxaas.hierarchy(date_based , dates)

[33]:

date_hierarchy

[33]:

[['2020-07-16 15:28:024', '16.07.2020-15:28:00', '16.07.2020-15:28', '2020'],
 ['2019-07-16 16:38:025', '16.07.2019-16:38:00', '16.07.2019-16:38', '2019'],
 ['2019-07-16 17:48:025', '16.07.2019-17:48:00', '16.07.2019-17:48', '2019'],
 ['2019-07-16 18:48:025', '16.07.2019-18:48:00', '16.07.2019-18:48', '2019'],
 ['2019-06-16 19:48:025', '16.06.2019-19:48:00', '16.06.2019-19:48', '2019'],
 ['2019-06-16 20:48:025', '16.06.2019-20:48:00', '16.06.2019-20:48', '2019']]

Example anonymization¶

[23]:

dataset = Dataset.from_pandas(data_df)

[24]:

dataset.set_attribute_type(AttributeType.IDENTIFYING, "salary")

[25]:

dataset.describe()

data:
  headers:
    ['zipcode', 'age', 'salary', 'disease']
rows:
    [47677, 29, 3, 'gastric ulcer']
    [47602, 22, 4, 'gastritis']
    [47678, 27, 5, 'stomach cancer']
    [47905, 43, 6, 'gastritis']
    [47909, 52, 11, 'flu']
    ...
attributes:
  field_name=zipcode, type=QUASIIDENTIFYING, hierarchy=None
  field_name=age, type=QUASIIDENTIFYING, hierarchy=None
  field_name=salary, type=IDENTIFYING, hierarchy=None
  field_name=disease, type=QUASIIDENTIFYING, hierarchy=None

[26]:

dataset.set_hierarchy("age", interval_hierarchy)

[27]:

dataset.set_hierarchy("zipcode", redaction_hierarchy)

[28]:

dataset.set_hierarchy("disease", order_hierarchy)

[29]:

anon_result = arxaas.anonymize(dataset=dataset, privacy_models=[KAnonymity(2)])

[30]:

anon_result.dataset.to_dataframe()

[30]:

	zipcode	age	salary	disease
0	47***	young-adult	*	stomach-related
1	47***	young-adult	*	stomach-related
2	47***	young-adult	*	stomach-related
3	47***	adult	*	stomach-related
4	47***	adult	*	lung-related
5	47***	adult	*	lung-related
6	47***	adult	*	lung-related
7	47***	adult	*	lung-related
8	47***	adult	*	stomach-related

[ ]: