정규표현식 톺아보기 with JavaScript and Python

Python에서의 정규표현식 활용

1. Python의 정규표현식 기초

1.1 re 모듈

python의 정규표현식은 re 모듈을 사용하여 처리할 수 있습니다. re 모듈을 사용하는 방법으로는 크게 2가지로 나뉩니다. 첫 번째는 re.compile() 메서드를 사용하여 패턴을 미리 컴파일하고, 두 번째는 re 모듈의 메서드를 직접 사용하는 방법입니다.

1.1.1 기본 import와 사용법

import re
 
# 1. 기본 패턴 매칭
pattern = re.compile(r'hello')
print(pattern.match('hello world'))
 
# 2. re 모듈의 메서드 사용
print(re.match(r'hello', 'hello world'))

import re
 
# 1. 기본 패턴 매칭
pattern = re.compile(r'hello')
print(pattern.match('hello world'))
 
# 2. re 모듈의 메서드 사용
print(re.match(r'hello', 'hello world'))

1.1.2 raw 문자열 사용

python에서 raw문자열을 사용하면 개행등의 특수문자를 이스케이프하지 않고 그대로 사용할 수 있습니다. 아래와 같이 한 번 실행해보세요.

print('hello\nworld')
print(r'hello\nworld')

print('hello\nworld')
print(r'hello\nworld')

아래와 같이 raw 문자열을 사용하면 정규표현식을 더 깔끔하게 작성할 수 있습니다.

# raw 문자열을 사용하지 않은 경우
pattern1 = '\\d+'  # 백슬래시를 이스케이프하기 위해 두 번 써야 함
 
# raw 문자열을 사용한 경우
pattern2 = r'\d+'  # 더 깔끔하고 실수를 줄일 수 있음

# raw 문자열을 사용하지 않은 경우
pattern1 = '\\d+'  # 백슬래시를 이스케이프하기 위해 두 번 써야 함
 
# raw 문자열을 사용한 경우
pattern2 = r'\d+'  # 더 깔끔하고 실수를 줄일 수 있음

1.2 컴파일 플래그

1.2.1 주요 플래그

re.IGNORECASE  # 또는 re.I: 대소문자 구분 없음
re.MULTILINE   # 또는 re.M: 다중 행 모드
re.DOTALL      # 또는 re.S: .이 개행 문자도 매칭
re.VERBOSE     # 또는 re.X: 상세 모드 (공백과 주석 허용)

re.IGNORECASE  # 또는 re.I: 대소문자 구분 없음
re.MULTILINE   # 또는 re.M: 다중 행 모드
re.DOTALL      # 또는 re.S: .이 개행 문자도 매칭
re.VERBOSE     # 또는 re.X: 상세 모드 (공백과 주석 허용)

1.2.2 플래그 조합

플래그는 단독으로 사용할 수도 있고, 여러 개를 조합하여 사용할 수도 있습니다. 플래그는 re.MULTILINE | re.DOTALL와 같이 | 연산자를 사용하여 조합할 수 있습니다.

import re
 
# 플래그를 사용하지 않은 경우
text = """Hello World
hello python
HELLO REGEX"""
 
pattern = re.compile(r'hello')
findall_result = pattern.findall(text)
print(findall_result)  # ['hello']
 
# 플래그를 사용한 경우
pattern = re.compile(r'hello', re.IGNORECASE) 
findall_result = pattern.findall(text)
print(findall_result)  # ['Hello', 'hello', 'HELLO']

import re
 
# 플래그를 사용하지 않은 경우
text = """Hello World
hello python
HELLO REGEX"""
 
pattern = re.compile(r'hello')
findall_result = pattern.findall(text)
print(findall_result)  # ['hello']
 
# 플래그를 사용한 경우
pattern = re.compile(r'hello', re.IGNORECASE) 
findall_result = pattern.findall(text)
print(findall_result)  # ['Hello', 'hello', 'HELLO']

2. 기본 메서드

2.1 검색 메서드

검색 메서드는 match, search, findall, finditer 메서드가 있습니다. match는 문자열의 처음부터 정규식과 매치되는지 조사합니다. 다만 처음부터 매칭이 안되는 경우 None을 반활 할 수 있기 때문에 문자열 전체를 검색해야 할 때에는 search를 사용합니다.findall은 정규식과 매치되는 모든 문자열을 리스트로 반환하고, finditer는 정규식과 매치되는 모든 문자열을 반복 가능한 객체로 반환합니다.

2.1.1 match와 search

text = "welcome hello world"
pattern = re.compile(r'hello')
 
# match()는 문자열 시작부터 검색
print(pattern.match(text))  # None
 
# search()는 문자열 전체에서 검색
print(pattern.search(text))  # <re.Match object>

text = "welcome hello world"
pattern = re.compile(r'hello')
 
# match()는 문자열 시작부터 검색
print(pattern.match(text))  # None
 
# search()는 문자열 전체에서 검색
print(pattern.search(text))  # <re.Match object>

2.1.2 finditer와 findall

text = "hello world hello"
pattern = re.compile(r'\w+')
 
# findall()은 모든 매칭을 리스트로 반환
print(pattern.findall(text))  # ['hello', 'world', 'hello']
 
# finditer()는 매칭을 반복 가능한 객체로 반환
for m in pattern.finditer(text):
    print(m)  # <re.Match object>

text = "hello world hello"
pattern = re.compile(r'\w+')
 
# findall()은 모든 매칭을 리스트로 반환
print(pattern.findall(text))  # ['hello', 'world', 'hello']
 
# finditer()는 매칭을 반복 가능한 객체로 반환
for m in pattern.finditer(text):
    print(m)  # <re.Match object>

2.2 문자열 처리 메서드

sub와 split 메서드는 정규식을 사용하여 문자열을 처리할 때 유용합니다. sub 메서드는 정규식과 매칭되는 문자열을 다른 문자열로 대체하고, split 메서드는 정규식과 매칭되는 문자열을 기준으로 문자열을 나눕니다. sub는 string에 replace와 유사하지만 replace는 정규식을 사용할 수 없습니다.

2.2.1 sub 메서드

text = "hello world"
pattern = re.compile(r'o')
result = pattern.sub('0', text)  # 'hell0 w0rld'

text = "hello world"
pattern = re.compile(r'o')
result = pattern.sub('0', text)  # 'hell0 w0rld'

sub를 사용하여 그룹으로 묶은 패턴을 사용할 수도 있습니다. 아래와 같이 사용할 수 있습니다.

text = "hello world"
pattern = re.compile(r'(\w+)\s(\w+)')
print(pattern.sub(r'\2 \1 \2', text))  # 'world hello world'

text = "hello world"
pattern = re.compile(r'(\w+)\s(\w+)')
print(pattern.sub(r'\2 \1 \2', text))  # 'world hello world'

이러한 방식으로 좀 더 고급 패턴을 사용할 수 있습니다.

text = "064.1000!20000"
pattern = re.compile(r'(\d{3})[.!](\d{4})[.!](\d{5})')
print(pattern.sub(r'\1-\2-\3', text))  # '064-1000-20000'

text = "064.1000!20000"
pattern = re.compile(r'(\d{3})[.!](\d{4})[.!](\d{5})')
print(pattern.sub(r'\1-\2-\3', text))  # '064-1000-20000'

re.sub 메서드는 다음과 같은 형식으로 사용할 수 있습니다.

re.sub(pattern, repl, string, count=0, flags=0)

re.sub(pattern, repl, string, count=0, flags=0)

2.2.2 split 메서드

text = "hello,world;hi"
result = re.split(r'[,;]', text)  # ['hello', 'world', 'hi']

text = "hello,world;hi"
result = re.split(r'[,;]', text)  # ['hello', 'world', 'hi']

3. 고급 기능

각 패턴에 이름을 부여하여 추출할 때 사용할 수 있습니다. 이름을 부여하는 방법은 (?P<name>...) 형식으로 사용하면 됩니다.

pattern = re.compile(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})')
m = pattern.match("2024-11-02")
if m:
    print(m.group('year'))   # 2024
    print(m.group('month'))  # 11
    print(m.group('day'))    # 02

pattern = re.compile(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})')
m = pattern.match("2024-11-02")
if m:
    print(m.group('year'))   # 2024
    print(m.group('month'))  # 11
    print(m.group('day'))    # 02

4. 실습 문제

4.1 기본 실습

# 실습 1: URL 추출
text = """
https://github.com/LiveCoronaDetector/livecod
https://github.com/Weniv
https://weniv.co.kr
hello@gmail.com
hello.co.kr
https://weniv
"""
 
# 실습 2: 괄호 내용 추출
text = "[(name, leehojun), (age, 10), (height, 180)]"
 
# 실습 3: 한글 단어 추출
text = "안녕하세요 hello 반갑습니다 world"

# 실습 1: URL 추출
text = """
https://github.com/LiveCoronaDetector/livecod
https://github.com/Weniv
https://weniv.co.kr
hello@gmail.com
hello.co.kr
https://weniv
"""
 
# 실습 2: 괄호 내용 추출
text = "[(name, leehojun), (age, 10), (height, 180)]"
 
# 실습 3: 한글 단어 추출
text = "안녕하세요 hello 반갑습니다 world"

주어진 텍스트에서 모든 URL을 추출하세요.
괄호로 둘러싸인 내용을 추출하세요.
한글로 된 단어만 추출하세요.

4.2 문제 풀이

# 실습 1: URL 추출
text = """
https://github.com/LiveCoronaDetector/livecod
https://github.com/Weniv
https://weniv.co.kr
hello@gmail.com
hello.co.kr
https://weniv
"""
pattern = re.compile(r'https?://[^\s]+|(?:[\w-]+\.)+[\w-]+/\S+')
print(pattern.findall(text))
 
# 실습 2: 괄호 내용 추출
text = "[(name, leehojun), (age, 10), (height, 180)]"
pattern = re.compile(r'\(([^)]+)\)')
print(pattern.findall(text))
 
# 실습 3: 한글 단어 추출
text = "안녕하세요 hello 반갑습니다 world"
pattern = re.compile(r'[가-힣]+')
print(pattern.findall(text))

# 실습 1: URL 추출
text = """
https://github.com/LiveCoronaDetector/livecod
https://github.com/Weniv
https://weniv.co.kr
hello@gmail.com
hello.co.kr
https://weniv
"""
pattern = re.compile(r'https?://[^\s]+|(?:[\w-]+\.)+[\w-]+/\S+')
print(pattern.findall(text))
 
# 실습 2: 괄호 내용 추출
text = "[(name, leehojun), (age, 10), (height, 180)]"
pattern = re.compile(r'\(([^)]+)\)')
print(pattern.findall(text))
 
# 실습 3: 한글 단어 추출
text = "안녕하세요 hello 반갑습니다 world"
pattern = re.compile(r'[가-힣]+')
print(pattern.findall(text))