Python RegEx for NLP
    01/08/22 - #regex #nlp #python

RegEx for NLP

Natural Language Processing requires advanced manipulation of the text strings you intend to process. To allow a machine to understand human language it is necessary to divide and group the various elements that compose it; RegEx are the ideal tool to do this. This article, which I will continue to update over time, lists some of the most useful RegEx for NLP, with sample code attached.

Remove Digits:

How to remove numbers with regex:

                        import re

# RegEx - Remove digits
pattern = r"\d+"
test_phrase = r"I w4nt 4 sm4rtph0n3. Can you help me? Buy it on https://amazon.com"

result = re.sub(pattern, '', test_phrase)
print(result)
                    

Remove Symbols

How to remove special characters and symbols with regex:

                        import re

# RegEx - Remove spechal characters
pattern = r"[^a-zA-Z0-9]+"
test_phrase = r"I want a smartphone. Can you help me? Buy it on https://amazon.com"

result = re.sub(pattern, ' ', test_phrase)
print(result)
                    

Remove Hyperlinks

How to remove Hyperlinks with RegEx:

                        import re

# RegEx - Remove hyperlinks
pattern = r"http\S+"
test_phrase = r"I want a smartphone. Can you help me? Buy it on https://amazon.com"

result = re.sub(pattern, '', test_phrase)
print(result)
                    

Split Phrases

How to split phrases with RegEx

                        import re

# RegEx - Split phrases
pattern = r"\.|\!|\?"
test_phrase = r"I want a smartphone. Can you help me? Thanks!"

result = re.split(pattern, test_phrase)
print(result)
                    

Extract prices

How to extract a price with RegEx:

                        import re

# RegEx - Extract a price
pattern = r"(?:¥|£|€|\$)\s*([\d,]+(?:\.\d+)?)|([\d,]+(?:\.\d+)?)\s*(?:¥|£|€|\$)"
test_phrase = r"I want a smartphone that cost less then €300"

result = re.search(pattern, test_phrase)
print(result.group())
                    

Remove HTML tags

How to remove HTML tags with RegEx:

                        import re

# RegEx - Extract a price
pattern = r"<.*?>"
test_phrase = r"<div>I want a smartphone that cost less then <b>€300</b></div>"

result = re.sub(pattern, '', text)
print(result.group())