Closed
Description
I was trying to use presidio to identify and remove IP addresses, and I ran into the following issue. It was recognizing '::'
as a string containing an IP address, and '2345:0425:2CA1:0000:0000:0567:5673:23b5'
was not being recognized as an IP address. I ran a couple of tests as follows:
analyzer = AnalyzerEngine()
results = analyzer.analyze(text='::',
entities=['IP_ADDRESS'],
language='en')
print(results)
results2 = analyzer.analyze(text='2345:0425:2CA1:0000:0000:0567:5673:23b5',
entities=['IP_ADDRESS'],
language='en')
print(results2)
results3 = analyzer.analyze(text='2345:0425:2CA1::0567:5673:23b5',
entities=['IP_ADDRESS'],
language='en')
print(results3)
Output:
[type: IP_ADDRESS, start: 0, end: 2, score: 0.6]
[]
[type: IP_ADDRESS, start: 13, end: 30, score: 0.6]
This made it seem like it is just identifying an IPV6 address as any element that contains two consecutive colons. I then checked the source code, and found this in the tests:
Can the IPv6 regex be fixed?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment