Thursday, February 2, 2023
Learning Code
  • Home
  • JavaScript
  • Java
  • Python
  • Swift
  • C++
  • C#
No Result
View All Result
  • Home
  • JavaScript
  • Java
  • Python
  • Swift
  • C++
  • C#
No Result
View All Result
Learning Code
No Result
View All Result
Home Python

How to Replace a String in Python – Real Python

learningcode_x1mckf by learningcode_x1mckf
September 14, 2022
in Python
0
How to Replace a String in Python – Real Python
74
SHARES
1.2k
VIEWS
Share on FacebookShare on Twitter


In case you’re searching for methods to take away or exchange all or a part of a string in Python, then this tutorial is for you. You’ll be taking a fictional chat room transcript and sanitizing it utilizing each the .exchange() technique and the re.sub() perform.

In Python, the .exchange() technique and the re.sub() perform are sometimes used to scrub up textual content by eradicating strings or substrings or changing them. On this tutorial, you’ll be taking part in the position of a developer for a corporation that gives technical help by means of a one-to-one textual content chat. You’re tasked with making a script that’ll sanitize the chat, eradicating any personal data and changing any swear phrases with emoji.

You’re solely given one very brief chat transcript:

[support_tom] 2022-08-24T10:02:23+00:00 : What can I aid you with?
[johndoe] 2022-08-24T10:03:15+00:00 : I CAN'T CONNECT TO MY BLASTED ACCOUNT
[support_tom] 2022-08-24T10:03:30+00:00 : Are you positive it is not your caps lock?
[johndoe] 2022-08-24T10:04:03+00:00 : Blast! You are proper!

Though this transcript is brief, it’s typical of the kind of chats that brokers have on a regular basis. It has person identifiers, ISO time stamps, and messages.

On this case, the consumer johndoe filed a grievance, and firm coverage is to sanitize and simplify the transcript, then cross it on for impartial analysis. Sanitizing the message is your job!

The very first thing you’ll need to do is to maintain any swear phrases.

Learn how to Take away or Substitute a Python String or Substring

Essentially the most primary approach to exchange a string in Python is to make use of the .exchange() string technique:

>>>

>>> "Faux Python".exchange("Faux", "Actual")
'Actual Python'

As you’ll be able to see, you’ll be able to chain .exchange() onto any string and supply the tactic with two arguments. The primary is the string that you simply need to exchange, and the second is the substitute.

Notice: Though the Python shell shows the results of .exchange(), the string itself stays unchanged. You’ll be able to see this extra clearly by assigning your string to a variable:

>>>

>>> title = "Faux Python"
>>> title.exchange("Faux", "Actual")
'Actual Python'

>>> title
'Faux Python'

>>> title = title.exchange("Faux", "Actual")
'Actual Python'

>>> title
'Actual Python'

Discover that while you merely name .exchange(), the worth of title doesn’t change. However while you assign the results of title.exchange() to the title variable, 'Faux Python' turns into 'Actual Python'.

Now it’s time to use this information to the transcript:

>>>

>>> transcript = """
... [support_tom] 2022-08-24T10:02:23+00:00 : What can I aid you with?
... [johndoe] 2022-08-24T10:03:15+00:00 : I CAN'T CONNECT TO MY BLASTED ACCOUNT
... [support_tom] 2022-08-24T10:03:30+00:00 : Are you positive it is not your caps lock?
... [johndoe] 2022-08-24T10:04:03+00:00 : Blast! You are proper!"""

>>> transcript.exchange("BLASTED", "😤")
[support_tom] 2022-08-24T10:02:23+00:00 : What can I aid you with?
[johndoe] 2022-08-24T10:03:15+00:00 : I CAN'T CONNECT TO MY 😤 ACCOUNT
[support_tom] 2022-08-24T10:03:30+00:00 : Are you positive it is not your caps lock?
[johndoe] 2022-08-24T10:04:03+00:00 : Blast! You are proper!

Loading the transcript as a triple-quoted string after which utilizing the .exchange() technique on one of many swear phrases works positive. However there’s one other swear phrase that’s not getting changed as a result of in Python, the string must match precisely:

>>>

>>> "Faux Python".exchange("faux", "Actual")
'Faux Python'

As you’ll be able to see, even when the casing of 1 letter doesn’t match, it’ll forestall any replacements. Because of this if you happen to’re utilizing the .exchange() technique, you’ll have to name it varied occasions with the variations. On this case, you’ll be able to simply chain on one other name to .exchange():

>>>

>>> transcript.exchange("BLASTED", "😤").exchange("Blast", "😤")
[support_tom] 2022-08-24T10:02:23+00:00 : What can I aid you with?
[johndoe] 2022-08-24T10:03:15+00:00 : I CAN'T CONNECT TO MY 😤 ACCOUNT
[support_tom] 2022-08-24T10:03:30+00:00 : Are you positive it is not your caps lock?
[johndoe] 2022-08-24T10:04:03+00:00 : 😤! You are proper!

Success! However you’re most likely considering that this isn’t one of the simplest ways to do that for one thing like a general-purpose transcription sanitizer. You’ll need to transfer towards a way of getting a listing of replacements, as an alternative of getting to sort out .exchange() every time.

Set Up A number of Alternative Guidelines

There are just a few extra replacements that it’s essential to make to the transcript to get it right into a format acceptable for impartial evaluate:

  • Shorten or take away the time stamps
  • Substitute the usernames with Agent and Consumer

Now that you simply’re beginning to have extra strings to interchange, chaining on .exchange() goes to get repetitive. One concept could possibly be to maintain a list of tuples, with two objects in every tuple. The 2 objects would correspond to the arguments that it’s essential to cross into the .exchange() technique—the string to interchange and the substitute string:

# transcript_multiple_replace.py

REPLACEMENTS = [
    ("BLASTED", "😤"),
    ("Blast", "😤"),
    ("2022-08-24T", ""),
    ("+00:00", ""),
    ("[support_tom]", "Agent "),
    ("[johndoe]", "Consumer"),
]

transcript = """
[support_tom] 2022-08-24T10:02:23+00:00 : What can I aid you with?
[johndoe] 2022-08-24T10:03:15+00:00 : I CAN'T CONNECT TO MY BLASTED ACCOUNT
[support_tom] 2022-08-24T10:03:30+00:00 : Are you positive it is not your caps lock?
[johndoe] 2022-08-24T10:04:03+00:00 : Blast! You are proper!
"""

for outdated, new in REPLACEMENTS:
    transcript = transcript.exchange(outdated, new)

print(transcript)

On this model of your transcript-cleaning script, you created a listing of substitute tuples, which provides you a fast approach so as to add replacements. You possibly can even create this record of tuples from an exterior CSV file if you happen to had a great deal of replacements.

You then iterate over the record of substitute tuples. In every iteration, you name .exchange() on the string, populating the arguments with the outdated and new variables which have been unpacked from every substitute tuple.

Notice: The unpacking within the for loop on this case is functionally the identical as utilizing indexing:

for substitute in replacements:
    new_transcript = new_transcript.exchange(substitute[0], substitute[1])

In case you’re mystified by unpacking, then try the section on unpacking from the tutorial on Python lists and tuples.

With this, you’ve made a giant enchancment within the total readability of the transcript. It’s additionally simpler so as to add replacements if it’s essential to. Working this script reveals a a lot cleaner transcript:

$ python transcript_multiple_replace.py
Agent  10:02:23 : What can I aid you with?
Consumer 10:03:15 : I CAN'T CONNECT TO MY 😤 ACCOUNT
Agent  10:03:30 : Are you positive it is not your caps lock?
Consumer 10:04:03 : 😤! You are proper!

That’s a reasonably clear transcript. Possibly that’s all you want. But when your interior automator isn’t joyful, possibly it’s as a result of there are nonetheless some issues that could be bugging you:

  • Changing the swear phrases received’t work if there’s one other variation utilizing -ing or a unique capitalization, like BLAst.
  • Eradicating the date from the time stamp at the moment solely works for August 24, 2022.
  • Eradicating the total time stamp would contain organising substitute pairs for each attainable time—not one thing you’re too eager on doing.
  • Including the area after Agent to be able to line up your columns works however isn’t very basic.

If these are your considerations, then it’s possible you’ll need to flip your consideration to common expressions.

Leverage re.sub() to Make Complicated Guidelines

Everytime you’re trying to do any changing that’s barely extra advanced or wants some wildcards, you’ll often need to flip your consideration towards regular expressions, also called regex.

Regex is a type of mini-language made up of characters that outline a sample. These patterns, or regexes, are sometimes used to seek for strings in discover and discover and exchange operations. Many programming languages help regex, and it’s broadly used. Regex will even offer you superpowers.

In Python, leveraging regex means utilizing the re module’s sub() function and constructing your personal regex patterns:

# transcript_regex.py

import re

REGEX_REPLACEMENTS = [
    (r"blastw*", "😤"),
    (r" [-T:+d]25", ""),
    (r"[supportw*]", "Agent "),
    (r"[johndoe]", "Consumer"),
]

transcript = """
[support_tom] 2022-08-24T10:02:23+00:00 : What can I aid you with?
[johndoe] 2022-08-24T10:03:15+00:00 : I CAN'T CONNECT TO MY BLASTED ACCOUNT
[support_tom] 2022-08-24T10:03:30+00:00 : Are you positive it is not your caps lock?
[johndoe] 2022-08-24T10:04:03+00:00 : Blast! You are proper!
"""

for outdated, new in REGEX_REPLACEMENTS:
    transcript = re.sub(outdated, new, transcript, flags=re.IGNORECASE)

print(transcript)

Whilst you can combine and match the sub() perform with the .exchange() technique, this instance solely makes use of sub(), so you’ll be able to see the way it’s used. You’ll notice you could exchange all variations of the swear phrase by utilizing only one substitute tuple now. Equally, you’re solely utilizing one regex for the total time stamp:

$ python transcript_regex.py
Agent  : What can I aid you with?
Consumer : I CAN'T CONNECT TO MY 😤 ACCOUNT
Agent  : Are you positive it is not your caps lock?
Consumer : 😤! You are proper!

Now your transcript has been fully sanitized, with all noise eliminated! How did that occur? That’s the magic of regex.

The first regex sample, "blastw*", makes use of the w special character, which can match alphanumeric characters and underscores. Including the * quantifier straight after it’ll match zero or extra characters of w.

One other important a part of the primary sample is that the re.IGNORECASE flag makes it a case-insensitive sample. So now, any substring containing blast, no matter capitalization, might be matched and changed.

Notice: The "blastw*" sample is sort of broad and also will modify fibroblast to fibro😤. It can also’t establish a well mannered use of the phrase. It simply matches the characters. That stated, the everyday swear phrases that you simply’d need to censor don’t actually have well mannered alternate meanings!

The second regex sample makes use of character sets and quantifiers to interchange the time stamp. You usually use character units and quantifiers collectively. A regex sample of [abc], for instance, will match one character of a, b, or c. Placing a * straight after it could match zero or extra characters of a, b, or c.

There are extra quantifiers, although. In case you used [abc]10, it could match precisely ten characters of a, b or c in any order and any mixture. Additionally notice that repeating characters is redundant, so [aa] is equal to [a].

For the time stamp, you utilize an prolonged character set of [-T:+d] to match all of the attainable characters that you simply would possibly discover within the time stamp. Paired with the quantifier 25, this can match any attainable time stamp, no less than till the 12 months 10,000.

Notice: The particular character, d, matches any digit character.

The time stamp regex sample lets you choose any attainable date within the time stamp format. Seeing because the the occasions aren’t essential for the impartial reviewer of those transcripts, you exchange them with an empty string. It’s attainable to write down a extra superior regex that preserves the time data whereas eradicating the date.

The third regex sample is used to pick out any person string that begins with the key phrase "help". Notice that you simply escape () the sq. bracket ([) because otherwise the keyword would be interpreted as a character set.

Finally, the last regex pattern selects the client username string and replaces it with "Client".

Note: While it would be great fun to go into more detail about these regex patterns, this tutorial isn’t about regex. Work through the Python regex tutorial for a good primer on the subject. Also, you can make use of the fantastic RegExr web site, because regex is tricky and regex wizards of all levels rely on handy tools like RegExr.

RegExr is particularly good because you can copy and paste regex patterns, and it’ll break them down for you with explanations.

With regex, you can drastically cut down the number of replacements that you have to write out. That said, you still may have to come up with many patterns. Seeing as regex isn’t the most readable of languages, having lots of patterns can quickly become hard to maintain.

Thankfully, there’s a neat trick with re.sub() that allows you to have a bit more control over how replacement works, and it offers a much more maintainable architecture.

Use a Callback With re.sub() for Even More Control

One trick that Python and sub() have up their sleeves is that you can pass in a callback function instead of the replacement string. This gives you total control over how to match and replace.

To get started building this version of the transcript-sanitizing script, you’ll use a basic regex pattern to see how using a callback with sub() works:

# transcript_regex_callback.py

import re

transcript = """
[support_tom] 2022-08-24T10:02:23+00:00 : What can I aid you with?
[johndoe] 2022-08-24T10:03:15+00:00 : I CAN'T CONNECT TO MY BLASTED ACCOUNT
[support_tom] 2022-08-24T10:03:30+00:00 : Are you positive it is not your caps lock?
[johndoe] 2022-08-24T10:04:03+00:00 : Blast! You are proper!
"""

def sanitize_message(match):
    print(match)

re.sub(r"[-T:+d]25", sanitize_message, transcript)

The regex sample that you simply’re utilizing will match the time stamps, and as an alternative of offering a substitute string, you’re passing in a reference to the sanitize_message() perform. Now, when sub() finds a match, it’ll name sanitize_message() with a match object as an argument.

Since sanitize_message() simply prints the item that it’s obtained as an argument, when working this, you’ll see the match objects being printed to the console:

$ python transcript_regex_callback.py
<re.Match object; span=(15, 40), match='2022-08-24T10:02:23+00:00'>
<re.Match object; span=(79, 104), match='2022-08-24T10:03:15+00:00'>
<re.Match object; span=(159, 184), match='2022-08-24T10:03:30+00:00'>
<re.Match object; span=(235, 260), match='2022-08-24T10:04:03+00:00'>

A match object is among the constructing blocks of the re module. The extra primary re.match() perform returns a match object. sub() doesn’t return any match objects however makes use of them behind the scenes.

Since you get this match object within the callback, you should use any of the data contained inside it to construct the substitute string. As soon as it’s constructed, you come back the brand new string, and sub() will exchange the match with the returned string.

Apply the Callback to the Script

In your transcript-sanitizing script, you’ll make use of the .teams() technique of the match object to return the contents of the 2 seize teams, after which you’ll be able to sanitize every half in its personal perform or discard it:

# transcript_regex_callback.py

import re

ENTRY_PATTERN = (
    r"[(.+)] "  # Consumer string, discarding sq. brackets
    r"[-T:+d]25 "  # Time stamp
    r": "  # Separator
    r"(.+)"  # Message
)
BAD_WORDS = ["blast", "dash", "beezlebub"]
CLIENTS = ["johndoe", "janedoe"]

def censor_bad_words(message):
    for phrase in BAD_WORDS:
        message = re.sub(rf"phrasew*", "😤", message, flags=re.IGNORECASE)
    return message

def censor_users(person):
    if person.startswith("help"):
        return "Agent"
    elif person in CLIENTS:
        return "Consumer"
    else:
        elevate ValueError(f"unknown consumer: 'person'")

def sanitize_message(match):
    person, message = match.teams()
    return f"censor_users(person):<6 : censor_bad_words(message)"

transcript = """
[support_tom] 2022-08-24T10:02:23+00:00 : What can I aid you with?
[johndoe] 2022-08-24T10:03:15+00:00 : I CAN'T CONNECT TO MY BLASTED ACCOUNT
[support_tom] 2022-08-24T10:03:30+00:00 : Are you positive it is not your caps lock?
[johndoe] 2022-08-24T10:04:03+00:00 : Blast! You are proper!
"""

print(re.sub(ENTRY_PATTERN, sanitize_message, transcript))

As an alternative of getting numerous totally different regexes, you’ll be able to have one prime stage regex that may match the entire line, dividing it up into seize teams with brackets (()). The seize teams haven’t any impact on the precise matching course of, however they do have an effect on the match object that outcomes from the match:

  • [(.+)] matches any sequence of characters wrapped in sq. brackets. The seize group picks out the username string, as an example johndoe.
  • [-T:+d]25 matches the time stamp, which you explored within the final part. Because you received’t be utilizing the time stamp within the closing transcript, it’s not captured with brackets.
  • : matches a literal colon. The colon is used as a separator between the message metadata and the message itself.
  • (.+) matches any sequence of characters till the tip of the road, which would be the message.

The content material of the capturing teams might be obtainable as separate objects within the match object by calling the .teams() technique, which returns a tuple of the matched strings.

Notice: The entry regex definition makes use of Python’s implicit string concatenation:

You might also like

Build a JavaScript Front End for a Flask API – Real Python

Using the Terminal on Linux – Real Python

How to Iterate Over Rows in pandas, and Why You Shouldn’t – Real Python

ENTRY_PATTERN = (
    r"[(.+)] "  # Consumer string, discarding sq. brackets
    r"[-T:+d]25 "  # Time stamp
    r": "  # Separator
    r"(.+)"  # Message
)

Functionally, this is identical as writing all of it out as one single string: r"[(.+)] [-T:+d]25 : (.+)". Organizing your longer regex patterns on separate traces permit you to break it up into chunks, which not solely makes it extra readable but in addition permit you to insert feedback too.

The 2 teams are the person string and the message. The .teams() technique returns them as a tuple of strings. Within the sanitize_message() perform, you first use unpacking to assign the 2 strings to variables:

def sanitize_message(match):
    person, message = match.teams()
    return f"censor_users(person):<6 : censor_bad_words(message)"

Notice how this structure permits a really broad and inclusive regex on the prime stage, after which helps you to complement it with extra exact regexes throughout the substitute callback.

The sanitize_message() perform makes use of two capabilities to scrub up usernames and dangerous phrases. It moreover makes use of f-strings to justify the messages. Notice how censor_bad_words() makes use of a dynamically created regex whereas censor_users() depends on extra primary string processing.

That is now trying like a superb first prototype for a transcript-sanitizing script! The output is squeaky clear:

$ python transcript_regex_callback.py
Agent  : What can I aid you with?
Consumer : I CAN'T CONNECT TO MY 😤 ACCOUNT
Agent  : Are you positive it is not your caps lock?
Consumer : 😤! You are proper!

Good! Utilizing sub() with a callback provides you way more flexibility to combine and match totally different strategies and construct regexes dynamically. This construction additionally provides you essentially the most room to develop when your bosses or purchasers inevitably change their necessities on you!

Abstract

On this tutorial, you’ve discovered exchange strings in Python. Alongside the way in which, you’ve gone from utilizing the fundamental Python .exchange() string technique to utilizing callbacks with re.sub() for absolute management. You’ve additionally explored some regex patterns and deconstructed them into a greater structure to handle a substitute script.

With all that information, you’ve efficiently cleaned a chat transcript, which is now prepared for impartial evaluate. Not solely that, however your transcript-sanitizing script has loads of room to develop.





Source link

Share30Tweet19
learningcode_x1mckf

learningcode_x1mckf

Recommended For You

Build a JavaScript Front End for a Flask API – Real Python

by learningcode_x1mckf
February 1, 2023
0
Build a JavaScript Front End for a Flask API – Real Python

Most fashionable net functions are powered by a REST API below the hood. That manner, builders can separate JavaScript front-end code from the back-end logic that an online...

Read more

Using the Terminal on Linux – Real Python

by learningcode_x1mckf
January 31, 2023
0
Using the Terminal on Linux – Real Python

The terminal might be intimidating to work with once you’re used to working with graphical consumer interfaces. Nonetheless, it’s an vital device that you have to get used...

Read more

How to Iterate Over Rows in pandas, and Why You Shouldn’t – Real Python

by learningcode_x1mckf
January 30, 2023
0
How to Iterate Over Rows in pandas, and Why You Shouldn’t – Real Python

One of the crucial frequent questions you may need when coming into the world of pandas is easy methods to iterate over rows in a pandas DataFrame. In...

Read more

Orchestrating Large and Small Projects With Apache Airflow – The Real Python Podcast

by learningcode_x1mckf
January 27, 2023
0
Orchestrating Large and Small Projects With Apache Airflow – The Real Python Podcast

Jan 27, 2023 54m Have you ever labored on a mission that wanted an orchestration device? How do you outline the workflow of a complete information pipeline or...

Read more

Try Out Code and Ideas Quickly – Real Python

by learningcode_x1mckf
January 25, 2023
0
Try Out Code and Ideas Quickly – Real Python

The Python customary shell, or REPL (Learn-Eval-Print Loop), lets you run Python code interactively whereas engaged on a mission or studying the language. This instrument is on the...

Read more
Next Post
Java Cryptography Extension (JCE) with FIPS Validated Cryptography for Edge Computing Environments

Java Cryptography Extension (JCE) with FIPS Validated Cryptography for Edge Computing Environments

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Related News

Is Java Losing Ground to Other Popular Programming Languages?

Is Java Losing Ground to Other Popular Programming Languages?

December 29, 2022
COBOL programming skills gap thwarts modernization to Java

COBOL programming skills gap thwarts modernization to Java

January 22, 2023
Rust Users Rejoice! Chromium Adds Support For The Better C++

Rust Users Rejoice! Chromium Adds Support For The Better C++

January 24, 2023

Browse by Category

  • C#
  • C++
  • Java
  • JavaScript
  • Python
  • Swift

RECENT POSTS

  • Java :Full Stack Developer – Western Cape saon_careerjunctionza_state
  • Pay What You Want for this Learn to Code JavaScript Certification Bundle
  • UPB Java Jam brings coffeehouse vibes to Taylor Down Under | Culture

CATEGORIES

  • C#
  • C++
  • Java
  • JavaScript
  • Python
  • Swift

© 2022 Copyright Learning Code

No Result
View All Result
  • Home
  • JavaScript
  • Java
  • Python
  • Swift
  • C++
  • C#

© 2022 Copyright Learning Code

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?