Skip to content

Commit 4abd1de

Browse files
committed
Adding output format param to save_annotated and adding new method save_text_extraction.
Updated Readme file.
1 parent 0e522c6 commit 4abd1de

File tree

3 files changed

+75
-15
lines changed

3 files changed

+75
-15
lines changed

README.md

Lines changed: 38 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -163,17 +163,49 @@ for ticker in references.tickers:
163163
print(ticker)
164164
```
165165

166-
### Text Analytics
166+
## Text Analytics
167167

168168
Analyse your own content using RavenPack’s proprietary NLP technology.
169169

170-
The API for analyzing your internal content is still in beta and may change in the future. You can request an early
171-
access and [see an example of usage here](ravenpackapi/examples/text_extraction.py).
170+
The API for analyzing your internal content is still in beta and may change in the future. You can request an early access and [see an example of usage here](ravenpackapi/examples/text_analytics_example.py).
171+
172+
### Uploading a file
173+
Upload a file to the system. In order to successfully have your files analized by RavenPack's text analytics platform, you need to perform the following method:
174+
175+
```python
176+
f = api.upload.file("_orig.doc")
177+
```
178+
179+
Different options and features are available when uploading a file for development. For more information, please check the user guide found on RavenPack's platform.
180+
181+
### Getting analytics
182+
Saves analytics for the processed files. You can choose to retrieve analytics in JSON-Lines or CSV format:
183+
184+
```python
185+
f.save_analytics("_analytics.json")
186+
```
187+
188+
### Getting normalized documents
189+
RavenPack’s Text Analytics provides normalized content in JSON format, along with text categorization, tables in HTML format and metadata derived from the original document.
190+
191+
```python
192+
f.save_text_extraction("_text_extraction.json")
193+
```
194+
195+
It is also possible to obtain the normalized content in JSON format, along with annotations of entities, events and analytics derived from the content.
196+
197+
```python
198+
f.save_annotated("_annotated_document.json", output_format='application/json')
199+
```
200+
201+
For further details, please [see the example of usage exposed here](ravenpackapi/examples/text_analytics_example.py).
202+
203+
204+
172205

173206
### Accessing the low-level requests
174207

175-
RavenPack API wrapper is using the [requests library](https://2.python-requests.org) to do HTTPS requests, you can set
176-
common requests parameters to all the outbound calls by setting the `common_request_params` attribute.
208+
RavenPack API wrapper is using the [requests library](https://2.python-requests.org) to do HTTPS requests, you can set common requests parameters to all the outbound calls by setting the `common_request_params` attribute.
177209

178210
For example, to disable HTTPS certificate verification and to setup your internal proxy:
179211

@@ -189,4 +221,4 @@ api.common_request_params.update(
189221
# use the api to do requests
190222
```
191223

192-
PS. For setting your internal proxies, requests will honor the HTTPS_PROXY environment variable.
224+
PS. For setting your internal proxies, requests will honor the HTTPS_PROXY environment variable.

ravenpackapi/examples/text_extraction.py renamed to ravenpackapi/examples/text_analytics_example.py

Lines changed: 19 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,15 @@
88
print(f)
99

1010
# upload a file to access the analytics
11-
f = api.upload.file("_orig.doc",
12-
# upload_mode="RPXML"
13-
# properties={"primary_entity": "RavenPack"}
14-
)
11+
f = api.upload.file("_orig.doc")
12+
#f = api.upload.file("_orig.doc",
13+
# upload_mode="RPJSON"
14+
# properties={
15+
# "primary_entity": "RavenPack",
16+
# "provider_document_id": "<YOUR_DOCUMENT_ID>"
17+
# "extractor": "PDF_TABLE_EXTRACTOR"
18+
# }
19+
#)
1520

1621
# you can also upload from a publicly available URL
1722
# f = api.upload.file("demo.html",
@@ -24,19 +29,26 @@
2429
# f = api.upload.get('XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
2530

2631
# get back the analytics found in the document
27-
f.save_analytics("_analytics.json")
32+
# f.save_analytics("_analytics.csv", output_format='text/csv')
33+
f.save_analytics("_analytics.json", output_format='application/json')
2834

2935
# the annotated version
30-
f.save_annotated("us30orig.xml")
36+
# f.save_annotated("_annotated_document.xml", output_format='application/xml')
37+
f.save_annotated("_annotated_document.json", output_format='application/json')
3138

3239
# or the original
3340
f.save_original("_orig.doc")
3441

35-
# show the extracted text
42+
# show or save the extracted text
3643
# extracted_text = f.text_extraction()
44+
f.save_text_extraction("_text_extraction.json", output_format='application/json')
3745

3846
# given a file we can set tags
3947
# f.set_metadata(tags=['file tag'])
48+
# f.get_metadata()
49+
50+
# return the process status of the file
51+
f.get_status()
4052

4153
# ... or delete it
4254
# f.delete()

ravenpackapi/upload/models.py

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -102,11 +102,15 @@ def get_analytics(self, output_format='application/json'):
102102
return response.text
103103

104104
@api_method
105-
def save_annotated(self, filename):
105+
def save_annotated(self, filename, output_format='application/xml'):
106106
self.wait_for_completion()
107107
response = retry_on_too_early(self.api.request,
108108
'%s/files/%s/annotated' % (self.api._UPLOAD_BASE_URL, self.file_id),
109-
stream=True)
109+
stream=True,
110+
headers=dict(
111+
Accept=output_format,
112+
**self.api.headers
113+
))
110114
with open(filename, 'wb') as f:
111115
for chunk in response.iter_content(chunk_size=self.api._CHUNK_SIZE):
112116
f.write(chunk)
@@ -165,6 +169,18 @@ def text_extraction(self, output_format="text/csv"):
165169
)
166170
return response.text
167171

172+
@api_method
173+
def save_text_extraction(self, filename, output_format='application/json'):
174+
headers = self.api.headers.copy()
175+
headers["Content-type"] = output_format
176+
response = retry_on_too_early(self.api.request,
177+
'%s/files/%s/text-extraction' % (self.api._UPLOAD_BASE_URL, self.file_id),
178+
stream=True,
179+
headers=headers)
180+
with open(filename, 'wb') as f:
181+
for chunk in response.iter_content(chunk_size=self.api._CHUNK_SIZE):
182+
f.write(chunk)
183+
168184

169185
class Folder(object):
170186
""" A Folder containing files """

0 commit comments

Comments
 (0)