Skip to content

Commit 1c0f204

Browse files
committed
DOC: extract more shared includes for comparison pages
1 parent 6938550 commit 1c0f204

File tree

12 files changed

+129
-219
lines changed

12 files changed

+129
-219
lines changed

doc/source/getting_started/comparison/comparison_with_sas.rst

Lines changed: 11 additions & 108 deletions
Original file line numberDiff line numberDiff line change
@@ -342,15 +342,7 @@ you supply as the second argument.
342342
put(FINDW(sex,'ale'));
343343
run;
344344
345-
Python determines the position of a character in a string with the
346-
``find`` function. ``find`` searches for the first position of the
347-
substring. If the substring is found, the function returns its
348-
position. Keep in mind that Python indexes are zero-based and
349-
the function will return -1 if it fails to find the substring.
350-
351-
.. ipython:: python
352-
353-
tips["sex"].str.find("ale").head()
345+
.. include:: includes/find_substring.rst
354346

355347

356348
Extracting substring by position
@@ -366,13 +358,7 @@ SAS extracts a substring from a string based on its position with the
366358
put(substr(sex,1,1));
367359
run;
368360
369-
With pandas you can use ``[]`` notation to extract a substring
370-
from a string by position locations. Keep in mind that Python
371-
indexes are zero-based.
372-
373-
.. ipython:: python
374-
375-
tips["sex"].str[0:1].head()
361+
.. include:: includes/extract_substring.rst
376362

377363

378364
Extracting nth word
@@ -394,16 +380,7 @@ second argument specifies which word you want to extract.
394380
;;;
395381
run;
396382
397-
Python extracts a substring from a string based on its text
398-
by using regular expressions. There are much more powerful
399-
approaches, but this just shows a simple approach.
400-
401-
.. ipython:: python
402-
403-
firstlast = pd.DataFrame({"String": ["John Smith", "Jane Cook"]})
404-
firstlast["First_Name"] = firstlast["String"].str.split(" ", expand=True)[0]
405-
firstlast["Last_Name"] = firstlast["String"].str.rsplit(" ", expand=True)[0]
406-
firstlast
383+
.. include:: includes/nth_word.rst
407384

408385

409386
Changing case
@@ -427,27 +404,13 @@ functions change the case of the argument.
427404
;;;
428405
run;
429406
430-
The equivalent Python functions are ``upper``, ``lower``, and ``title``.
407+
.. include:: includes/case.rst
431408

432-
.. ipython:: python
433-
434-
firstlast = pd.DataFrame({"String": ["John Smith", "Jane Cook"]})
435-
firstlast["string_up"] = firstlast["String"].str.upper()
436-
firstlast["string_low"] = firstlast["String"].str.lower()
437-
firstlast["string_prop"] = firstlast["String"].str.title()
438-
firstlast
439409

440410
Merging
441411
-------
442412

443-
The following tables will be used in the merge examples
444-
445-
.. ipython:: python
446-
447-
df1 = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)})
448-
df1
449-
df2 = pd.DataFrame({"key": ["B", "D", "D", "E"], "value": np.random.randn(4)})
450-
df2
413+
.. include:: includes/merge_setup.rst
451414

452415
In SAS, data must be explicitly sorted before merging. Different
453416
types of joins are accomplished using the ``in=`` dummy
@@ -473,39 +436,13 @@ input frames.
473436
if a or b then output outer_join;
474437
run;
475438
476-
pandas DataFrames have a :meth:`~DataFrame.merge` method, which provides
477-
similar functionality. Note that the data does not have
478-
to be sorted ahead of time, and different join
479-
types are accomplished via the ``how`` keyword.
480-
481-
.. ipython:: python
482-
483-
inner_join = df1.merge(df2, on=["key"], how="inner")
484-
inner_join
485-
486-
left_join = df1.merge(df2, on=["key"], how="left")
487-
left_join
488-
489-
right_join = df1.merge(df2, on=["key"], how="right")
490-
right_join
491-
492-
outer_join = df1.merge(df2, on=["key"], how="outer")
493-
outer_join
439+
.. include:: includes/merge.rst
494440

495441

496442
Missing data
497443
------------
498444

499-
Like SAS, pandas has a representation for missing data - which is the
500-
special float value ``NaN`` (not a number). Many of the semantics
501-
are the same, for example missing data propagates through numeric
502-
operations, and is ignored by default for aggregations.
503-
504-
.. ipython:: python
505-
506-
outer_join
507-
outer_join["value_x"] + outer_join["value_y"]
508-
outer_join["value_x"].sum()
445+
.. include:: includes/missing_intro.rst
509446

510447
One difference is that missing data cannot be compared to its sentinel value.
511448
For example, in SAS you could do this to filter missing values.
@@ -522,25 +459,7 @@ For example, in SAS you could do this to filter missing values.
522459
if value_x ^= .;
523460
run;
524461
525-
Which doesn't work in pandas. Instead, the ``pd.isna`` or ``pd.notna`` functions
526-
should be used for comparisons.
527-
528-
.. ipython:: python
529-
530-
outer_join[pd.isna(outer_join["value_x"])]
531-
outer_join[pd.notna(outer_join["value_x"])]
532-
533-
pandas also provides a variety of methods to work with missing data - some of
534-
which would be challenging to express in SAS. For example, there are methods to
535-
drop all rows with any missing values, replacing missing values with a specified
536-
value, like the mean, or forward filling from previous rows. See the
537-
:ref:`missing data documentation<missing_data>` for more.
538-
539-
.. ipython:: python
540-
541-
outer_join.dropna()
542-
outer_join.fillna(method="ffill")
543-
outer_join["value_x"].fillna(outer_join["value_x"].mean())
462+
.. include:: includes/missing.rst
544463

545464

546465
GroupBy
@@ -549,7 +468,7 @@ GroupBy
549468
Aggregation
550469
~~~~~~~~~~~
551470

552-
SAS's PROC SUMMARY can be used to group by one or
471+
SAS's ``PROC SUMMARY`` can be used to group by one or
553472
more key variables and compute aggregations on
554473
numeric columns.
555474

@@ -561,14 +480,7 @@ numeric columns.
561480
output out=tips_summed sum=;
562481
run;
563482
564-
pandas provides a flexible ``groupby`` mechanism that
565-
allows similar aggregations. See the :ref:`groupby documentation<groupby>`
566-
for more details and examples.
567-
568-
.. ipython:: python
569-
570-
tips_summed = tips.groupby(["sex", "smoker"])[["total_bill", "tip"]].sum()
571-
tips_summed.head()
483+
.. include:: includes/groupby.rst
572484

573485

574486
Transformation
@@ -597,16 +509,7 @@ example, to subtract the mean for each observation by smoker group.
597509
if a and b;
598510
run;
599511
600-
601-
pandas ``groupby`` provides a ``transform`` mechanism that allows
602-
these type of operations to be succinctly expressed in one
603-
operation.
604-
605-
.. ipython:: python
606-
607-
gb = tips.groupby("smoker")["total_bill"]
608-
tips["adj_total_bill"] = tips["total_bill"] - gb.transform("mean")
609-
tips.head()
512+
.. include:: includes/transform.rst
610513

611514

612515
By group processing

doc/source/getting_started/comparison/comparison_with_stata.rst

Lines changed: 10 additions & 111 deletions
Original file line numberDiff line numberDiff line change
@@ -311,15 +311,7 @@ first position of the substring you supply as the second argument.
311311
312312
generate str_position = strpos(sex, "ale")
313313
314-
Python determines the position of a character in a string with the
315-
:func:`find` function. ``find`` searches for the first position of the
316-
substring. If the substring is found, the function returns its
317-
position. Keep in mind that Python indexes are zero-based and
318-
the function will return -1 if it fails to find the substring.
319-
320-
.. ipython:: python
321-
322-
tips["sex"].str.find("ale").head()
314+
.. include:: includes/find_substring.rst
323315

324316

325317
Extracting substring by position
@@ -331,13 +323,7 @@ Stata extracts a substring from a string based on its position with the :func:`s
331323
332324
generate short_sex = substr(sex, 1, 1)
333325
334-
With pandas you can use ``[]`` notation to extract a substring
335-
from a string by position locations. Keep in mind that Python
336-
indexes are zero-based.
337-
338-
.. ipython:: python
339-
340-
tips["sex"].str[0:1].head()
326+
.. include:: includes/extract_substring.rst
341327

342328

343329
Extracting nth word
@@ -358,16 +344,7 @@ second argument specifies which word you want to extract.
358344
generate first_name = word(name, 1)
359345
generate last_name = word(name, -1)
360346
361-
Python extracts a substring from a string based on its text
362-
by using regular expressions. There are much more powerful
363-
approaches, but this just shows a simple approach.
364-
365-
.. ipython:: python
366-
367-
firstlast = pd.DataFrame({"string": ["John Smith", "Jane Cook"]})
368-
firstlast["First_Name"] = firstlast["string"].str.split(" ", expand=True)[0]
369-
firstlast["Last_Name"] = firstlast["string"].str.rsplit(" ", expand=True)[0]
370-
firstlast
347+
.. include:: includes/nth_word.rst
371348

372349

373350
Changing case
@@ -390,27 +367,13 @@ change the case of ASCII and Unicode strings, respectively.
390367
generate title = strproper(string)
391368
list
392369
393-
The equivalent Python functions are ``upper``, ``lower``, and ``title``.
394-
395-
.. ipython:: python
370+
.. include:: includes/case.rst
396371

397-
firstlast = pd.DataFrame({"string": ["John Smith", "Jane Cook"]})
398-
firstlast["upper"] = firstlast["string"].str.upper()
399-
firstlast["lower"] = firstlast["string"].str.lower()
400-
firstlast["title"] = firstlast["string"].str.title()
401-
firstlast
402372

403373
Merging
404374
-------
405375

406-
The following tables will be used in the merge examples
407-
408-
.. ipython:: python
409-
410-
df1 = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)})
411-
df1
412-
df2 = pd.DataFrame({"key": ["B", "D", "D", "E"], "value": np.random.randn(4)})
413-
df2
376+
.. include:: includes/merge_setup.rst
414377

415378
In Stata, to perform a merge, one data set must be in memory
416379
and the other must be referenced as a file name on disk. In
@@ -465,38 +428,13 @@ or the intersection of the two by using the values created in the
465428
restore
466429
merge 1:n key using df2.dta
467430
468-
pandas DataFrames have a :meth:`DataFrame.merge` method, which provides
469-
similar functionality. Note that different join
470-
types are accomplished via the ``how`` keyword.
471-
472-
.. ipython:: python
473-
474-
inner_join = df1.merge(df2, on=["key"], how="inner")
475-
inner_join
476-
477-
left_join = df1.merge(df2, on=["key"], how="left")
478-
left_join
479-
480-
right_join = df1.merge(df2, on=["key"], how="right")
481-
right_join
482-
483-
outer_join = df1.merge(df2, on=["key"], how="outer")
484-
outer_join
431+
.. include:: includes/merge_setup.rst
485432

486433

487434
Missing data
488435
------------
489436

490-
Like Stata, pandas has a representation for missing data -- the
491-
special float value ``NaN`` (not a number). Many of the semantics
492-
are the same; for example missing data propagates through numeric
493-
operations, and is ignored by default for aggregations.
494-
495-
.. ipython:: python
496-
497-
outer_join
498-
outer_join["value_x"] + outer_join["value_y"]
499-
outer_join["value_x"].sum()
437+
.. include:: includes/missing_intro.rst
500438

501439
One difference is that missing data cannot be compared to its sentinel value.
502440
For example, in Stata you could do this to filter missing values.
@@ -508,30 +446,7 @@ For example, in Stata you could do this to filter missing values.
508446
* Keep non-missing values
509447
list if value_x != .
510448
511-
This doesn't work in pandas. Instead, the :func:`pd.isna` or :func:`pd.notna` functions
512-
should be used for comparisons.
513-
514-
.. ipython:: python
515-
516-
outer_join[pd.isna(outer_join["value_x"])]
517-
outer_join[pd.notna(outer_join["value_x"])]
518-
519-
pandas also provides a variety of methods to work with missing data -- some of
520-
which would be challenging to express in Stata. For example, there are methods to
521-
drop all rows with any missing values, replacing missing values with a specified
522-
value, like the mean, or forward filling from previous rows. See the
523-
:ref:`missing data documentation<missing_data>` for more.
524-
525-
.. ipython:: python
526-
527-
# Drop rows with any missing value
528-
outer_join.dropna()
529-
530-
# Fill forwards
531-
outer_join.fillna(method="ffill")
532-
533-
# Impute missing values with the mean
534-
outer_join["value_x"].fillna(outer_join["value_x"].mean())
449+
.. include:: includes/missing.rst
535450

536451

537452
GroupBy
@@ -548,14 +463,7 @@ numeric columns.
548463
549464
collapse (sum) total_bill tip, by(sex smoker)
550465
551-
pandas provides a flexible ``groupby`` mechanism that
552-
allows similar aggregations. See the :ref:`groupby documentation<groupby>`
553-
for more details and examples.
554-
555-
.. ipython:: python
556-
557-
tips_summed = tips.groupby(["sex", "smoker"])[["total_bill", "tip"]].sum()
558-
tips_summed.head()
466+
.. include:: includes/groupby.rst
559467

560468

561469
Transformation
@@ -570,16 +478,7 @@ For example, to subtract the mean for each observation by smoker group.
570478
bysort sex smoker: egen group_bill = mean(total_bill)
571479
generate adj_total_bill = total_bill - group_bill
572480
573-
574-
pandas ``groupby`` provides a ``transform`` mechanism that allows
575-
these type of operations to be succinctly expressed in one
576-
operation.
577-
578-
.. ipython:: python
579-
580-
gb = tips.groupby("smoker")["total_bill"]
581-
tips["adj_total_bill"] = tips["total_bill"] - gb.transform("mean")
582-
tips.head()
481+
.. include:: includes/transform.rst
583482

584483

585484
By group processing
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
The equivalent Python functions are ``upper``, ``lower``, and ``title``.
2+
3+
.. ipython:: python
4+
5+
firstlast = pd.DataFrame({"string": ["John Smith", "Jane Cook"]})
6+
firstlast["upper"] = firstlast["string"].str.upper()
7+
firstlast["lower"] = firstlast["string"].str.lower()
8+
firstlast["title"] = firstlast["string"].str.title()
9+
firstlast
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
With pandas you can use ``[]`` notation to extract a substring
2+
from a string by position locations. Keep in mind that Python
3+
indexes are zero-based.
4+
5+
.. ipython:: python
6+
7+
tips["sex"].str[0:1].head()

0 commit comments

Comments
 (0)