This project is for sharing data collected from 300 Android applications. The corpus is divided into two parts:
- Corpus 1 and Corpus 2: Distilled data (format described below)
- Raw APK data
Distilled data contains the following directories:
- Text description files: Each app has the full textual description
- Lemmatized text files: Text files on which lemmatization has been applied to
- Uses permission files: The actual permissions requested by the app identified from Android.xml file
- Highest frequency words and frequency by permission: The frequency counts for each application
The raw APK data is also made available for the keen user at (https://ibm.box.com/s/n3fre1ltdsb5hdievvyiq7f1urdw762q).
The data was collected so that the app coverage across well known as well as "dubious" apps is achieved.
Cite as: A. Palit, M. Srivatsa, R. Ganti, and C. Simpkin (2017). ‘Identifying Sensor Accesses from Service Descriptions', In Proc. of PADG workshop, co-located with IEEE BigData, 2017.